From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E1155EEE266
	for <linux-mm@archiver.kernel.org>; Thu, 12 Sep 2024 21:42:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 46CDD6B0083; Thu, 12 Sep 2024 17:42:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 41CA16B0088; Thu, 12 Sep 2024 17:42:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 296646B0089; Thu, 12 Sep 2024 17:42:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 0185F6B0083
	for <linux-mm@kvack.org>; Thu, 12 Sep 2024 17:42:00 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 93F0B1208BE
	for <linux-mm@kvack.org>; Thu, 12 Sep 2024 21:42:00 +0000 (UTC)
X-FDA: 82557409200.15.EFF783A
Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41])
	by imf04.hostedemail.com (Postfix) with ESMTP id 955B740008
	for <linux-mm@kvack.org>; Thu, 12 Sep 2024 21:41:58 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=EeCPX3KJ;
	spf=pass (imf04.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1726177290;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=rTZv2DvHUPeuuWtbWXNvhL20hGAkO1mn69EjhOlUabk=;
	b=7xpwJWw/h3gq+23mZ7J/cgX9ZZ2kJF4q1BnaOhdZB54i37cm2YMbGXf53TefDmLAvKFRl6
	wj0MnNcS179vzNY8dW1uPAVQueJ0m2Yc0sJ3nTkhZ8aCnOy1AXFvX1PguaEB8kZrXitQuR
	jegWFp+61YZAt9uOK2UwEr5CSEgMd0c=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=EeCPX3KJ;
	spf=pass (imf04.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726177290; a=rsa-sha256;
	cv=none;
	b=zZesv49Uo2D/0EubHUgYdcEXJHnVrt40iESUgZaRD3TS8yTrDOoAmHgj4uJ5f3Lv1yKyz7
	tzGO7GwLVfPqgdi24gySu0Mx4bw7UrP6SP2h8aZdKRg4AOoxsY9tqBesVhJfoqd9SZtpX0
	kzkCpyu01mbJCifzVykTWMRsc2e8NWw=
Received: by mail-ej1-f41.google.com with SMTP id a640c23a62f3a-a7aa086b077so179317766b.0
        for <linux-mm@kvack.org>; Thu, 12 Sep 2024 14:41:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1726177317; x=1726782117; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=rTZv2DvHUPeuuWtbWXNvhL20hGAkO1mn69EjhOlUabk=;
        b=EeCPX3KJh3cj/42J4RNHztZdKJfX6C7QxLRaxJE44SE2L4g3fqRZKrbHih+vdUbvx8
         353fkR6EDyrip2rrNT8TrRnDpERLHpi4txO7yt+MrX9mPAzcXIBWpDZXMiIZHvA0+RAa
         wJF9i/jaQVXPSs66RMa7g/VJSGvHWY/83sHnneKAm0LibvHrtBQulQCzl28s0pC7s5hG
         zICtrLvLKi+pgxpD/tJOEEAva5zBCigSGpe23tIvmGtWqmzC4mJSyLJ/6oY5Na0LCVgA
         hBDtm6RCgIJCXMCmOHiXU0QjqBp4Zt9InClDLY0LYTOJ5AIJb3FSxFgDnAE/Trm+7oir
         RhPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1726177317; x=1726782117;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=rTZv2DvHUPeuuWtbWXNvhL20hGAkO1mn69EjhOlUabk=;
        b=tqKVmUwfIEiBviqdC7m+S1h17SISA+8J1KV3ZQk3+WR/4mNgEMThzUNxcJVW8JjtAM
         EZoezW4e3dFzu3y6IUGfNbucgvPerChsuVzPcoY0ls3cluqtmkpb6eqxM5yqX66MIicp
         r00B1/IHgTIZtBTbTDZUekz4hle2e7zDgctAcRqtG1uZFGQyIMfHTQkBmLcYjjUplZfc
         xYXsyG5IZuoBBKnbFsS//5TCoANR4DqwIIG1of1/e31lfXo63z7po1xFHHT6H72pd0WS
         DKH4QMk5+SBX4vjntI6vOizV1bvpjwKssEusLTN58sNjgTQ3IS3sXBlbib7Ulp9KEjZP
         30nQ==
X-Forwarded-Encrypted: i=1; AJvYcCUDHUxYXe5C3NHOYlv02i2JCpxmVk9s9Cdo/NacO8V6rON4Yb4Y+GnnOJZwQQyerG0hpKh6CO4blw==@kvack.org
X-Gm-Message-State: AOJu0YwgxkQ1Kis+dOQNDAjZStdzGf9AnzgR16e65PLbwHMGzSR6pUBn
	twVtbi59wLp1jr+Il3YtxPanaN/U0uYa8USd+dwi6olAivGIOs09eeKkf0uGYDrr2b42b8HvHgY
	ZC2wsHTcBZ9WEPm6BkxZ+0dHUizqJrUZm19RM
X-Google-Smtp-Source: AGHT+IFJ6L/EjAQhvjUs3LCLWwrLAXw7wzIe3wRS5fwMYlDRY4tVJbtr+7tjdNlnrYIEmTliMOCPryTOB0VfVwINmm4=
X-Received: by 2002:a17:907:f168:b0:a8f:f799:e7db with SMTP id
 a640c23a62f3a-a9029438211mr405542366b.16.1726177315834; Thu, 12 Sep 2024
 14:41:55 -0700 (PDT)
MIME-Version: 1.0
References: <172616070094.2055617.17676042522679701515.stgit@firesoul>
In-Reply-To: <172616070094.2055617.17676042522679701515.stgit@firesoul>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 12 Sep 2024 14:41:18 -0700
Message-ID: <CAJD7tkZyKHRHJue1NaNqTMo5Ue2BvZfW_GXetMb+M+OkiCoj8g@mail.gmail.com>
Subject: Re: [PATCH V11] cgroup/rstat: Avoid flushing if there is an ongoing
 root flush
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, 
	hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, 
	kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: 9ot8p7zoi3ozgme4cmkgcwpn7n8mba8j
X-Rspamd-Queue-Id: 955B740008
X-Rspamd-Server: rspam11
X-HE-Tag: 1726177318-559611
X-HE-Meta: U2FsdGVkX1+dP18sT1mVFjKEXwU5uzhThO6SDA+OeeO8l3HiQPCpks2C3MOgvPjkITsxLuVy/TniC1O0zpN3T3QbL0JJZAAI33UVPptPWwf4Ayrb/xY3/SXcN4hYJUDQoXvDmzmVF6noXFgDZq0h5CqtueLC8VtTYO76WprxmtsTTeHm5jck145yitNaxEprB1HO21I9TS0LDWbDGr4b+abBPTJpjG7Zf+pHTyHBruIXl4dqdAlxB7cGglDSq/52hapAmeMn/+iO4oflsAnM36y73tw046fNgd2xswqWEHR3SWV/GYy62OKj6Y4arqYYY2Khflsoz4X8X85H6Od9euoYSoVXxaWxUNAeb9MBDGj0teYwZtxR6T87GY8q62m9yUnXPqghlsNdCCFWN4x4PnQTgUEMl92+Tb1DYJ1Amhx9fCLp4CN+UPNSCh6QK4YbOzI8msYeitEJWIz9Ujblmv7EJUhOSDMqR+l4ZLX/JJevubveo/hlqs7DHgNiJqQpTj7XXrP/qhyfY0wNM3xRvbSDle7IkG9W8793NNcG0/LdjW33jK27bpfhfglcy7TLO2abU/wVSHOqO7aw0qyZjZrgr8fIAT29/VZoLVL+hC8/AFaMsz9I+/zdkrQiLFJY11fejBw1H1Pt3kdeFvwTP+dMQ7zw8N7yi2KyU5YQPq0b+fRpw39/uPaOhKm408NspTEeAQ1vL0JrxTpEV8XgjxoJrhB2Y+2+8WGNJ2+R5SDnWTYtIAD0qdn6iVCvul4y+1NvGUXyOGPXnXVHJV/931n9vo0370ORCSXvAMCuj6C0j/wytJTDI8DuD/sCc3AvNxXoQqtIOcaKsv+lkFHN6zS+2L83CN/RmqQMm0nf8gszGUP+7qSrrLIbK0V/IYhE9w8Lyk12sO/nlB37CYejwu0o8qVaSPXq0t152gYOn3WuMVrxcmGRQkNQR0OpTh0aZ5ckqdBiYUj16v0a9fd
 Mfy6dxWe
 mUUc00iicLIPGBwzh+QyaH9KEboGXiTBvOi/6o1+oVcf7jqx1CYBdHxacTmG4+pwhfy+pHwedfjwgNnF1Tu9PzRN+hAaUcgF1L3H8kOAVVA+zTOGGuGXER8MpP2f+H0kts9w8LWS3Py1rSLqM5xiCnkB77rYdd8jqdJTakFkj/GutolwAzlOJsfggK447Nj4okoY13pPBkBZ1kBC6vGqFHxCgcyxx4QUkyirO
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Sep 12, 2024 at 10:07=E2=80=AFAM Jesper Dangaard Brouer <hawk@kerne=
l.org> wrote:
>
> This patch reintroduces and generalizes the "stats_flush_ongoing" concept
> to avoid redundant flushes if there is an ongoing flush at cgroup root
> level, addressing production lock contention issues on the global cgroup
> rstat lock.
>
> In this revision userspace readers will wait for the ongoing flusher to
> complete before returning, to avoid reading out-dated stats just before
> they get updated. Generally in-kernel users will attempt to skip the
> flush in-order to get out of the lock contention state. Some in-kernel
> users of the cgroup_rstat_flush() API depend on waiting for the flush to
> complete before continuing. This patch introduce the call
> cgroup_rstat_flush_relaxed() with a wait_for_flush option to satisfy both
> use-cases.
>
> At Cloudflare, we observed significant performance degradation due to
> lock contention on the rstat lock, primarily caused by kswapd. The
> specific mem_cgroup_flush_stats() call inlined in shrink_node, which
> takes the rstat lock, is particularly problematic.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, w=
e
> noted severe lock contention on the rstat lock, causing 12 CPUs to waste
> cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedst=
at)
> for kthreads revealed that we are burning an average of 20,000 CPU cores
> fleet-wide on kswapd, primarily due to spinning on the rstat lock.
>
> Here's a brief overview of the issue:
> - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN thre=
ads
>   to wake up simultaneously.
> - The kswapd thread invokes shrink_node (via balance_pgdat), triggering t=
he
>   cgroup rstat flush operation as part of its work.
> - balance_pgdat() has a NULL value in target_mem_cgroup, causing
>   mem_cgroup_flush_stats() to flush with root_mem_cgroup.
>
> The kernel previously addressed this with a "stats_flush_ongoing" concept=
,
> which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree sta=
ts
> flushing"). This patch reintroduces and generalizes the concept to apply =
to
> all users of cgroup rstat, not just memcg.
>
> It have been a general theme to replace mem_cgroup_flush_stats() with
> mem_cgroup_flush_stats_ratelimited every time we see a new case of this
> issue. This will hide the contention issue until something starves the
> kthread that does the periodic 2 second flush (for 2 periods). In
> production we are seeing kthreads getting starved longer than 20 seconds.
> This often happens in connection with OOM killer. This recreates the
> kswapd lock contention situation at a very unfortunate point in time.
> Thus, it makes sense to have this ongoing flusher lock contention
> protection in place.
>
> In this patch only a root cgroup can become the ongoing flusher, as this =
solves
> the production issue. Letting other levels becoming ongoing flusher cause=
 root
> cgroup to contend on the lock again.
>
> This change significantly reduces lock contention, especially in
> environments with multiple NUMA nodes, thereby improving overall system
> performance.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---

To reiterate my response on v10, I prefer that this problem is handled
at the reclaim flushing site. This started as a nice simple and
generic approach, but ended up being complex and tailored to handle
the reclaim flushing case.

I do have some comments on the current implementation nonetheless.

> v11:
>  - Address Yosry request to wait-for-flush for userspace readers
>
> V10: https://lore.kernel.org/all/172547884995.206112.808619042206173396.s=
tgit@firesoul/
>
>  block/blk-cgroup.c         |    2 -
>  include/linux/cgroup.h     |    1
>  include/linux/memcontrol.h |    1
>  kernel/cgroup/rstat.c      |  133 ++++++++++++++++++++++++++++++++++++++=
++++--
>  mm/memcontrol.c            |   40 +++++++++----
>  mm/vmscan.c                |    2 -
>  mm/zswap.c                 |    2 -
>  7 files changed, 157 insertions(+), 24 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 37e6cc91d576..058393e7665a 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1200,7 +1200,7 @@ static int blkcg_print_stat(struct seq_file *sf, vo=
id *v)
>         if (!seq_css(sf)->parent)
>                 blkcg_fill_root_iostats();
>         else
> -               cgroup_rstat_flush(blkcg->css.cgroup);
> +               cgroup_rstat_flush_relaxed(blkcg->css.cgroup, true);
>
>         rcu_read_lock();
>         hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 2150ca60394b..ff65bc100ca5 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -691,6 +691,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cp=
u);
>  void cgroup_rstat_flush(struct cgroup *cgrp);
>  void cgroup_rstat_flush_hold(struct cgroup *cgrp);
>  void cgroup_rstat_flush_release(struct cgroup *cgrp);
> +int cgroup_rstat_flush_relaxed(struct cgroup *cgrp, bool wait_for_flush)=
;

We now have 4 different flavors of rstat flushing:

1. cgroup_rstat_flush() -> normal flush: lock, flush, unlock
2. cgroup_rstat_flush_hold() -> same, but keep the lock held
3. cgroup_rstat_flush_relaxed(wait_for_flush=3Dtrue) -> if someone is
already flushing us, wait, otherwise normal
4. cgroup_rstat_flush_relaxed(wait_for_flush=3Dfalse) -> if someone is
already flushing us, wait, otherwise normal

Why do we need the third one? Can't we just keep using (1) for now for
userspace readers?

>
>  /*
>   * Basic resource stats.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 030d34e9d117..7e24c5e1327f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1026,6 +1026,7 @@ unsigned long lruvec_page_state_local(struct lruvec=
 *lruvec,
>                                       enum node_stat_item idx);
>
>  void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
> +void mem_cgroup_flush_stats_relaxed(struct mem_cgroup *memcg);
>  void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
>
>  void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index a06b45272411..80a4b949138f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -11,6 +11,9 @@
>
>  static DEFINE_SPINLOCK(cgroup_rstat_lock);
>  static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static struct cgroup *cgrp_rstat_ongoing_flusher =3D NULL;
> +static struct task_struct *cgrp_rstat_ongoing_flusher_ID =3D NULL;

rstat_ongoing_flush_cgrp and rstat_ongoing_flush_task are probably
clearer names.

> +static DEFINE_MUTEX(cgrp_rstat_ongoing_flusher_serialize);
>
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>
> @@ -299,6 +302,68 @@ static inline void __cgroup_rstat_unlock(struct cgro=
up *cgrp, int cpu_in_loop)
>         spin_unlock_irq(&cgroup_rstat_lock);
>  }
>
> +static inline bool cgroup_is_root(struct cgroup *cgrp)
> +{
> +       return cgroup_parent(cgrp) =3D=3D NULL;
> +}
> +
> +/**
> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flu=
sher
> + * @cgrp: target cgroup
> + * @strict: always lock and ignore/skip ongoing flusher checks
> + *
> + * Function return value follow trylock semantics. Returning true when l=
ock is
> + * obtained. Returning false when not locked and it detected flushing ca=
n be
> + * skipped as another ongoing flusher is taking care of the flush.
> + *
> + * For callers that depend on flush completing before returning a strict=
 option
> + * is provided.
> + */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool stric=
t)
> +{
> +       struct cgroup *ongoing;
> +
> +       if (strict)
> +               goto lock;

Might as well just have a cgroup_rstat_lock_flusher() function instead
of the boolean parameter, which is also consistent with the
lock/trylock semantics you are following.

> +
> +       /*
> +        * Check if ongoing flusher is already taking care of this.  Desc=
endant
> +        * check is necessary due to cgroup v1 supporting multiple root's=
.
> +        */
> +       ongoing =3D READ_ONCE(cgrp_rstat_ongoing_flusher);
> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing))

The ongoing flusher may be going away, so cgroup_is_descendant() may
be a UAF as I pointed out a few times. I actually think even taking a
ref here may not work as it may be a flush from cgroup_rstat_exit().

One thing we can do is get the root of cgrp (probably
cgrp->root->cgrp?) and compare it to cgrp_rstat_ongoing_flusher
without ever dereferencing it. If the ongoing flusher is in fact the
root of cgrp, this implies a ref on it anyway so we can dereference it
if needed.

That would obviously need a comment to explain it.

> +               return false;
> +
> +       /* Grab right to be ongoing flusher */
> +       if (!ongoing && cgroup_is_root(cgrp)) {
> +               struct cgroup *old;
> +
> +               old =3D cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp);
> +               if (old) {
> +                       /* Lost race for being ongoing flusher */
> +                       if (cgroup_is_descendant(cgrp, old))
> +                               return false;
> +               }
> +               /* Due to lock yield combined with strict mode record ID =
*/

This needs a more detailed comment.

> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current);

This will overwrite the ID if we lost the race but are not a
descendant of the flusher that one the race, right?

> +       }
> +lock:
> +       __cgroup_rstat_lock(cgrp, -1);
> +
> +       return true;
> +}
> +
> +static void cgroup_rstat_unlock_flusher(struct cgroup *cgrp)
> +{
> +       if (cgrp =3D=3D READ_ONCE(cgrp_rstat_ongoing_flusher) &&
> +           READ_ONCE(cgrp_rstat_ongoing_flusher_ID) =3D=3D current) {
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, NULL);
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher, NULL);
> +       }
> +
> +       __cgroup_rstat_unlock(cgrp, -1);
> +}
> +
>  /* see cgroup_rstat_flush() */
>  static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>         __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
> @@ -333,6 +398,19 @@ static void cgroup_rstat_flush_locked(struct cgroup =
*cgrp)
>         }
>  }
>
> +static int __cgroup_rstat_flush(struct cgroup *cgrp, bool strict)
> +{
> +       might_sleep();
> +
> +       if (!cgroup_rstat_trylock_flusher(cgrp, strict))
> +               return false;
> +
> +       cgroup_rstat_flush_locked(cgrp);
> +       cgroup_rstat_unlock_flusher(cgrp);
> +
> +       return true;
> +}
> +
>  /**
>   * cgroup_rstat_flush - flush stats in @cgrp's subtree
>   * @cgrp: target cgroup
> @@ -348,11 +426,49 @@ static void cgroup_rstat_flush_locked(struct cgroup=
 *cgrp)
>   */
>  __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>  {
> -       might_sleep();
> +       __cgroup_rstat_flush(cgrp, true);
> +}
>
> -       __cgroup_rstat_lock(cgrp, -1);
> -       cgroup_rstat_flush_locked(cgrp);
> -       __cgroup_rstat_unlock(cgrp, -1);
> +int cgroup_rstat_flush_relaxed(struct cgroup *cgrp, bool wait_for_flush)
> +{
> +       bool flushed =3D __cgroup_rstat_flush(cgrp, false);
> +
> +       if (!flushed && wait_for_flush) {

Isn't the code below essentially open-coding completions in a less
efficient way? Anyway as I mentioned above I don't believe we really
need wait_for_flush for now. I was hoping we can make this work for
both in-kernel and userspace flushers, but I don't think we want to
add it just for userspace flushers.

> +               /*
> +                * Reaching here we know an ongoing flusher is running, t=
hat
> +                * will take care of flushing for us, but for caller to r=
ead
> +                * accurate stats we want to wait for this ongoing flushe=
r.
> +                *
> +                * TODO: When lock becomes mutex and no-yielding this cod=
e can
> +                * be simplifed as we can just sleep on the mutex lock.
> +                */
> +               struct task_struct *id, *cur_id;
> +               u64 timeout;
> +
> +               id =3D READ_ONCE(cgrp_rstat_ongoing_flusher_ID);
> +               timeout =3D jiffies_64 + msecs_to_jiffies(50);
> +
> +               if (!id)
> +                       return false;
> +
> +               cond_resched();
> +               /* We might get lucky and flush already completed */
> +               cur_id =3D READ_ONCE(cgrp_rstat_ongoing_flusher_ID);
> +
> +               /* Due to lock yield, make sure "id" flusher completes */
> +               while (cur_id =3D=3D id && time_before64(jiffies_64, time=
out)) {
> +                       cond_resched();
> +                       /* Use mutex to reduce stress on global lock */
> +                       mutex_lock(&cgrp_rstat_ongoing_flusher_serialize)=
;
> +                       __cgroup_rstat_lock(cgrp, -1);
> +                       /* Get lock with ongoing can happen due to yieldi=
ng */
> +                       cur_id =3D READ_ONCE(cgrp_rstat_ongoing_flusher_I=
D);
> +                       __cgroup_rstat_unlock(cgrp, -1);
> +                       mutex_unlock(&cgrp_rstat_ongoing_flusher_serializ=
e);
> +               }
> +       }
> +
> +       return flushed;
>  }
>
>  /**
> @@ -368,8 +484,11 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>         __acquires(&cgroup_rstat_lock)
>  {
>         might_sleep();
> -       __cgroup_rstat_lock(cgrp, -1);
> -       cgroup_rstat_flush_locked(cgrp);
> +
> +       if (cgroup_rstat_trylock_flusher(cgrp, false))
> +               cgroup_rstat_flush_locked(cgrp);
> +       else
> +               __cgroup_rstat_lock(cgrp, -1);
>  }
>
>  /**
> @@ -379,7 +498,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>  void cgroup_rstat_flush_release(struct cgroup *cgrp)
>         __releases(&cgroup_rstat_lock)
>  {
> -       __cgroup_rstat_unlock(cgrp, -1);
> +       cgroup_rstat_unlock_flusher(cgrp);
>  }
>
>  int cgroup_rstat_init(struct cgroup *cgrp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71fe2a95b8bd..6694f7a859b5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -871,12 +871,26 @@ static inline void memcg_rstat_updated(struct mem_c=
group *memcg, int val)
>         }
>  }
>
> -static void do_flush_stats(struct mem_cgroup *memcg)
> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait_for_flush=
)
>  {
> -       if (mem_cgroup_is_root(memcg))
> -               WRITE_ONCE(flush_last_time, jiffies_64);
> +       bool flushed =3D cgroup_rstat_flush_relaxed(memcg->css.cgroup,
> +                                                 wait_for_flush);
>
> -       cgroup_rstat_flush(memcg->css.cgroup);
> +       if (mem_cgroup_is_root(memcg) && flushed)
> +                   WRITE_ONCE(flush_last_time, jiffies_64);
> +}
> +
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg,
> +                                    bool wait_for_flush)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       if (!memcg)
> +               memcg =3D root_mem_cgroup;
> +
> +       if (memcg_vmstats_needs_flush(memcg->vmstats))
> +               do_flush_stats(memcg, wait_for_flush);
>  }
>
>  /*
> @@ -890,21 +904,19 @@ static void do_flush_stats(struct mem_cgroup *memcg=
)
>   */
>  void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -
> -       if (!memcg)
> -               memcg =3D root_mem_cgroup;
> +       __mem_cgroup_flush_stats(memcg, true);
> +}
>
> -       if (memcg_vmstats_needs_flush(memcg->vmstats))
> -               do_flush_stats(memcg);
> +void mem_cgroup_flush_stats_relaxed(struct mem_cgroup *memcg)
> +{
> +       __mem_cgroup_flush_stats(memcg, false);
>  }
>
>  void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
>  {
>         /* Only flush if the periodic flusher is one full cycle late */
>         if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH=
_TIME))
> -               mem_cgroup_flush_stats(memcg);
> +               mem_cgroup_flush_stats_relaxed(memcg);

mem_cgroup_flush_stats_ratelimited() now basically says: only flush if
the periodic flusher is late but no one else is flushing, if we end up
pursuing this we probably want to start documenting these variants.