From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B5FDEEC8747
	for <linux-mm@archiver.kernel.org>; Fri,  8 Sep 2023 01:03:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C52796B0078; Thu,  7 Sep 2023 21:03:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C023A8E0003; Thu,  7 Sep 2023 21:03:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AA2858D0001; Thu,  7 Sep 2023 21:03:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 91C716B0078
	for <linux-mm@kvack.org>; Thu,  7 Sep 2023 21:03:05 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 615BD120165
	for <linux-mm@kvack.org>; Fri,  8 Sep 2023 01:03:05 +0000 (UTC)
X-FDA: 81211631130.23.DA89CDF
Received: from mail-wr1-f46.google.com (mail-wr1-f46.google.com [209.85.221.46])
	by imf15.hostedemail.com (Postfix) with ESMTP id 63D6CA0039
	for <linux-mm@kvack.org>; Fri,  8 Sep 2023 01:03:03 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=cloudflare.com header.s=google header.b=quX+qZJB;
	spf=pass (imf15.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.221.46 as permitted sender) smtp.mailfrom=ivan@cloudflare.com;
	dmarc=pass (policy=reject) header.from=cloudflare.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1694134983;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vgOHdnSmXvgov+u7Wmn+BFpJR24lLmQ+cuhX0XC1/Sw=;
	b=zKsO9vHPaDW3z87sDNHqPy/mQWlWD/h7QnCHzUnsumwm9MoiKzJUZ98EZOE0jyo0ZQSZo6
	ITGfIkkiXX06gmHujj+QyoXOJ41VL8DE+7lzL4O1F7i5vzdz8hH+ZG9v0aXtwZUg81UoB6
	RZ03X1BLZhBLr1gsAVpba2RS9TiU1N8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694134983; a=rsa-sha256;
	cv=none;
	b=lcikLdno800Wd9KqyVCI1LI++raAnUuW5QD6eXAuGLyqQmnaU7pCzwVW0DfeJU70erI6Zr
	NOke9TtkKjkobWuwuCB7YNjxZ+zKsbB8jqXQIuCkdkW9+ZJJ2qo3R5dx1Dr18YyWG28mir
	YUCCNcc+7BuOvX+mMDsGI8rcxykZBNU=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=cloudflare.com header.s=google header.b=quX+qZJB;
	spf=pass (imf15.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.221.46 as permitted sender) smtp.mailfrom=ivan@cloudflare.com;
	dmarc=pass (policy=reject) header.from=cloudflare.com
Received: by mail-wr1-f46.google.com with SMTP id ffacd0b85a97d-31427ddd3fbso1396506f8f.0
        for <linux-mm@kvack.org>; Thu, 07 Sep 2023 18:03:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloudflare.com; s=google; t=1694134982; x=1694739782; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=vgOHdnSmXvgov+u7Wmn+BFpJR24lLmQ+cuhX0XC1/Sw=;
        b=quX+qZJBtnOdxMrteEG/6efdoWpL3nWCznCHkJSzihsSyp5ErwI0OGhBIKghGvsqs5
         I4hqLOtL/pUs9Tke2HEfzr0BBWVRETk0fm+qooNnn+xjmXPsI/nEPR0vHgeGaBooVWK1
         1dp1eLgvhAIcLxoHIbfKO24GMTplQB+oi3/1g=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694134982; x=1694739782;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=vgOHdnSmXvgov+u7Wmn+BFpJR24lLmQ+cuhX0XC1/Sw=;
        b=bjOutycXfwDqAv1HJ3bxmIfvc/XmDfdNIrgSzJGSTfs1o3jtrsXovihdFKpUpJ8b3D
         EBVBEGguNe69k9zaQe343btyc4lrFcedM9IDk0vtpXzgQN63+nucyaIcnN5JguHMiD6M
         JahCuwBRLmwAKKdSG8osUdwujQKisn0ntnyTvfRYdwFWjXeoqjMtcYrPlhmMOT1oHBpd
         +9l6rzcLShqUexQE9L6Nd0qJnXLm8YYBnLgK7GE7Ppt7SUvyBSmtB/IbZpnQch4kwVz6
         B1bcS6gYI8ymoMqGiRxcJ67xgdQPR2TOaQ36Ebp6I9+Ch+KV6HSw3d7gsgxSDQYDUcNK
         cW1Q==
X-Gm-Message-State: AOJu0YxamhwitmkImKEHjNCzwSJaj5xaZ2iigEmhDli22k+hsQRz02eS
	S0O2oOV8+Y5zXn449YRagQJB+6augLfCap8hbj+G7w==
X-Google-Smtp-Source: AGHT+IFNHaBgA21XROOY+BLU4uPnxCsx91/vUn24gLliJ3/v+LFikaZ4dHg+t0mHiuGjrOplZBJFilZbXOsaMNjQNU0=
X-Received: by 2002:a5d:4390:0:b0:317:648c:3895 with SMTP id
 i16-20020a5d4390000000b00317648c3895mr745761wrq.33.1694134981860; Thu, 07 Sep
 2023 18:03:01 -0700 (PDT)
MIME-Version: 1.0
References: <20230831165611.2610118-1-yosryahmed@google.com>
 <20230831165611.2610118-5-yosryahmed@google.com> <ZPX0kCKd4TaVLJY7@dhcp22.suse.cz>
 <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com>
In-Reply-To: <CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com>
From: Ivan Babrou <ivan@cloudflare.com>
Date: Thu, 7 Sep 2023 18:02:50 -0700
Message-ID: <CABWYdi1WNp9f20nRFEExn8QB1MwP7QXwvD6Q8xHHuTO2SUTLkA@mail.gmail.com>
Subject: Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for
 userspace reads
To: Wei Xu <weixugc@google.com>
Cc: Michal Hocko <mhocko@suse.com>, Yosry Ahmed <yosryahmed@google.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, 
	Muchun Song <muchun.song@linux.dev>, Tejun Heo <tj@kernel.org>, 
	=?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Waiman Long <longman@redhat.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, 
	linux-kernel@vger.kernel.org, Greg Thelen <gthelen@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 63D6CA0039
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: ted1cg34bjtydmbi8fmsojc5qe9b18jz
X-HE-Tag: 1694134983-477162
X-HE-Meta: U2FsdGVkX18RtAzEUBjbBk874wR7gcul8nGxTRunoTOKM6duhx9xIAM508Fy9FfY5sjUHGMKMkiqz/ie54jRqhmd0UKf924WpYnpmQtrQxgUn3MECnYd/a83j+7WQjvnk46oSsSf6NleyCPFCZbUR2uedPHkV/UkMpT3iCGjT0ueMhPB09kjzKi8CLPALQUdviH7Ibvcux7C6clnj2IdVugA+lbvuCjQm/HBtzScHg3gbESXZx7JMWUQ1Gdr/Dj0m2AFQGbiwz7wyPzhnTRM+pK+AgryhRtxi2eSc+q5ud0wZqNMUUFa2IzLFYkr+cPaLtMDVPU1iG5sYQzg1OT40vxq0WJ9xuYn6UGk9CbQdM/ytz1T4Qdd3kkAzsnO2hangoZXX0rhqSid1lWxr6ygKDMh3QssDvyaqP+JwqSAXFPSq6cq034PrvbZhzqAWpodnqLhoCY4K0jycrnDg5bXb2ZMejB7JO1BF5fEQLoxHCDDTHG7wIIHBC1bNhn0xdGPJPYZU8OvBSKszCdLv9Jw+qIl06JoL6YslaimuRAiw9fmpFIx/hB8hk54N2UE5QySsHx5WWQZKZNwKebZ4m7Y9vkJsQhAz0K7xAQPZAqBwBYEK6+mBl6UKTHasjvtwZjd15OnzPPqjhr5zkoXGz5WnZ3EE06nUFCVzCE2wSUoj1oR23heLcfuyKgsnI2o28/AGLmIZO53rE51EYVj+9pBbo1l53sWkaX58byyDG4hEcyg7Wlge03phoCnTisu6F+BwgMEMgXJ7FgESam2y3KsvrwL8+2FPdloJUqULVocLNyoxL1JO785tVA1lGy90wrMUtxaPqqWFo/o0sl725QgFbscDCP/J7fipMeERvPJ263dybe7p1/b+vzY/48z+UCzh+mevHJ/SdULYzBbOyQHdITmmvpuj1bzlpMZkRMpm+cFovEpY6nMxq2+PuZsp0b+gZqeDIgMKHnbb5C41RD
 HTA5THhk
 jckXRbxPQ5CJdZTsRwBkZvaN7Wfrbg2z27aXQnIPNgKN+fCrShrpO17G5nSiNZJuDyYUqyD86xCRc3bxeuWkWNcYbUJ2h6Xqxzpkauu7zTFW35t11znihLhBFYtvO/Wc7ZBWBm2BqMPiHhhdD54KgcjE4jg81UFaqM3eZskY9QcoMgSkq8Qv4oaNk4FLMx1Ztk3zjnh7uCI5a9GWGUa0rmjcnpumTVZdjmiIvIi/I0DEtDK/zDXNirCGPAdRJWmhR7DRZXl/UCKHORVdk3jq2jeiWJM3Kvf8xXSf4JPVtFnIMSCWcQo7yP4EY7a99IV9fKVUaGuM/uzhtqADnIimDyld/skROlhQ0gHLpRBNdiwi8MZSoP995BxJCKKUGnRA52hrPzq3k1sVarEB/aIEAXTrRHg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Sep 7, 2023 at 5:52=E2=80=AFPM Wei Xu <weixugc@google.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:15=E2=80=AFAM Michal Hocko <mhocko@suse.com> wro=
te:
> >
> > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > Unified flushing allows for great concurrency for paths that attempt =
to
> > > flush the stats, at the expense of potential staleness and a single
> > > flusher paying the extra cost of flushing the full tree.
> > >
> > > This tradeoff makes sense for in-kernel flushers that may observe hig=
h
> > > concurrency (e.g. reclaim, refault). For userspace readers, stale sta=
ts
> > > may be unexpected and problematic, especially when such stats are use=
d
> > > for critical paths such as userspace OOM handling. Additionally, a
> > > userspace reader will occasionally pay the cost of flushing the entir=
e
> > > hierarchy, which also causes problems in some cases [1].
> > >
> > > Opt userspace reads out of unified flushing. This makes the cost of
> > > reading the stats more predictable (proportional to the size of the
> > > subtree), as well as the freshness of the stats. Userspace readers ar=
e
> > > not expected to have similar concurrency to in-kernel flushers,
> > > serializing them among themselves and among in-kernel flushers should=
 be
> > > okay. Nonetheless, for extra safety, introduce a mutex when flushing =
for
> > > userspace readers to make sure only a single userspace reader can com=
pete
> > > with in-kernel flushers at a time. This takes away userspace ability =
to
> > > directly influence or hurt in-kernel lock contention.
> >
> > I think it would be helpful to note that the primary reason this is a
> > concern is that the spinlock is dropped during flushing under
> > contention.
> >
> > > An alternative is to remove flushing from the stats reading path
> > > completely, and rely on the periodic flusher. This should be accompan=
ied
> > > by making the periodic flushing period tunable, and providing an
> > > interface for userspace to force a flush, following a similar model t=
o
> > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > implementation needs to be changed because:
> > > - The cost of reading stats will be very cheap and we won't be able t=
o
> > >   take that back easily.
> > > - There are user-visible interfaces involved.
> > >
> > > Hence, let's go with the change that's most reversible first and revi=
sit
> > > as needed.
> > >
> > > This was tested on a machine with 256 cpus by running a synthetic tes=
t
> > > script [2] that creates 50 top-level cgroups, each with 5 children (2=
50
> > > leaf cgroups). Each leaf cgroup has 10 processes running that allocat=
e
> > > memory beyond the cgroup limit, invoking reclaim (which is an in-kern=
el
> > > unified flusher). Concurrently, one thread is spawned per-cgroup to r=
ead
> > > the stats every second (including root, top-level, and leaf cgroups -=
-
> > > so total 251 threads). No significant regressions were observed in th=
e
> > > total run time, which means that userspace readers are not significan=
tly
> > > affecting in-kernel flushers:
> > >
> > > Base (mm-unstable):
> > >
> > > real  0m22.500s
> > > user  0m9.399s
> > > sys   73m41.381s
> > >
> > > real  0m22.749s
> > > user  0m15.648s
> > > sys   73m13.113s
> > >
> > > real  0m22.466s
> > > user  0m10.000s
> > > sys   73m11.933s
> > >
> > > With this patch:
> > >
> > > real  0m23.092s
> > > user  0m10.110s
> > > sys   75m42.774s
> > >
> > > real  0m22.277s
> > > user  0m10.443s
> > > sys   72m7.182s
> > >
> > > real  0m24.127s
> > > user  0m12.617s
> > > sys   78m52.765s
> > >
> > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M4=
3KO9ME4-dsgfoQ@mail.gmail.com/
> > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjc=
OBZcz6POYTV-4g@mail.gmail.com/
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > OK, I can live with that but I still believe that locking involved in
> > the user interface only begs for issues later on as there is no control
> > over that lock contention other than the number of processes involved.
> > As it seems that we cannot make a consensus on this concern now and thi=
s
> > should be already helping existing workloads then let's just buy some
> > more time ;)
>
> Indeed, even though the new global mutex protects the kernel from the
> userspace contention on the rstats spinlock, its current
> implementation doesn't have any protection for the lock contention
> among the userspace threads and can cause significant delays to memcg
> stats reads.
>
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second. This is a significant regression.
> In comparison, without contention, each memory.stat read only takes
> 20us-50us in the kernel.  Almost all of the extra read time is spent
> on waiting for the new mutex. The time to flush rstats only accounts
> for 10us-50us (This test creates only 1K memory cgroups and doesn't
> generate any loads other than these stat reader threads).
>
>  Here are some ideas to control the lock contention on the mutex and
> reduce both the median and tail latencies of concurrent memcg stat
> reads:
>
> - Bring back the stats_flush_threshold check in
> mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> check provides a reasonable bound on the stats staleness while being
> able to filter out a large number of rstats flush requests, which
> reduces the contention on the mutex.
>
> - When contended, upgrade the per-memcg rstats flush in
> mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> contended flushes together.  We can wait for the ongoing flush to
> complete and eliminate repeated flush requests.

Full root memcg flush being slow is one of the issues that prompted this pa=
tch:

* https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4=
-dsgfoQ@mail.gmail.com/

I don't want us to regress in this regard.

> - Wait for the mutex and the ongoing flush with a timeout.  We should
> not use busy-wait, though.  We can bail out to read the stats without
> a flush after enough wait.  A long-stalled system call is much worse
> than somewhat stale stats in the corner cases in my opinion.
>
> Wei Xu
>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > Thanks!
> >
> > > ---
> > >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> > >  1 file changed, 20 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 94d5a6751a9e..46a7abf71c73 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgr=
oup_tree_per_node *mctz)
> > >  static void flush_memcg_stats_dwork(struct work_struct *w);
> > >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_=
dwork);
> > >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > > +static DEFINE_MUTEX(stats_user_flush_mutex);
> > >  static atomic_t stats_unified_flush_ongoing =3D ATOMIC_INIT(0);
> > >  static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0);
> > >  static u64 flush_next_time;
> > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *me=
mcg)
> > >       cgroup_rstat_flush(memcg->css.cgroup);
> > >  }
> > >
> > > +/*
> > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > > + * @memcg: memory cgroup to flush
> > > + *
> > > + * Flush the subtree of @memcg. A mutex is used for userspace reader=
s to gate
> > > + * the global rstat spinlock. This protects in-kernel flushers from =
userspace
> > > + * readers hogging the lock.
> >
> > readers hogging the lock as do_stats_flush drops the spinlock under
> > contention.
> >
> > > + */
> > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > > +{
> > > +     mutex_lock(&stats_user_flush_mutex);
> > > +     do_stats_flush(memcg);
> > > +     mutex_unlock(&stats_user_flush_mutex);
> > > +}
> > > +
> > >  /*
> > >   * do_unified_stats_flush - do a unified flush of memory cgroup stat=
istics
> > >   *
> > --
> > Michal Hocko
> > SUSE Labs
> >