From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E6E91C3DA6F
	for <linux-mm@archiver.kernel.org>; Fri, 25 Aug 2023 15:15:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 675232800B4; Fri, 25 Aug 2023 11:15:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 625372800A2; Fri, 25 Aug 2023 11:15:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4A0E62800B4; Fri, 25 Aug 2023 11:15:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 37C992800A2
	for <linux-mm@kvack.org>; Fri, 25 Aug 2023 11:15:36 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 0998D1606BC
	for <linux-mm@kvack.org>; Fri, 25 Aug 2023 15:15:36 +0000 (UTC)
X-FDA: 81162976272.24.7E5FB6F
Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50])
	by imf17.hostedemail.com (Postfix) with ESMTP id 1D33C4001E
	for <linux-mm@kvack.org>; Fri, 25 Aug 2023 15:15:32 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="v9rgKhW/";
	spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1692976533;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1gxXEdOB/IMxquDgU+zJqv+Cmd6hlTz7uMEp9VHRpC8=;
	b=jsIPX9DXJDwjoEsh2EYUcis4LiVEs6J29mVyhTBSaOKg29WHTdijQ8xMAyBme32deAJjM8
	pcLTFEB/RhAjgycZYhl+7rS1I4q8z+aoE/g9lwCLq7ky8cS+QadtF8u5ItSwK7Nbca+/rV
	pVtcAtjnqMu7SbXzzBKJzv10hp1nh8c=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="v9rgKhW/";
	spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692976533; a=rsa-sha256;
	cv=none;
	b=D85bcTvViwBml4Ud/+LZzMAMGw+BV2YuyIduC3Ks3LLHJvfk+hQX5eczFh4FpTqP9u94ot
	5TgGkWWve5r7DdzrVH3SPRS3dFvyUEX8myeu5gfOwM/+4hO1Ul1MrRQDrzJgW12dskxOAb
	8k8VKcSojwJsxXe6/iu60YIL78UbNWw=
Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-5280ef23593so1521204a12.3
        for <linux-mm@kvack.org>; Fri, 25 Aug 2023 08:15:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1692976531; x=1693581331;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=1gxXEdOB/IMxquDgU+zJqv+Cmd6hlTz7uMEp9VHRpC8=;
        b=v9rgKhW/wrrxHN6zPsoKuRmq5bzWNw9F9gEpCbdHdIuVFZlCB7ewA+sn+qO7hL3cTg
         LNH7hYsZEKvUEpseVg039E388FbeED8aIIM+Au0eFzEYzP5XVlXrHFYCUCjfoz9oueI/
         xgkOWV0p6orpu71DifrGZz99bkW97FVu7b3aBaihLniRMfjTiapQLVbATAAopJcxWP6f
         d9mwChzXv+2ibGS7UcCyzLnXROdHM7suMdH1BNp6n6X+fEE1AwHI5xHBQd/h22o59WOH
         M6kDtt+8Lf3Uet+U/+ZYEiE+ts45hUladY3/usx8lWQNakaZ8qw8quFyksADFgpX9/m6
         9cxg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1692976531; x=1693581331;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1gxXEdOB/IMxquDgU+zJqv+Cmd6hlTz7uMEp9VHRpC8=;
        b=VwY+yEwh6OQ9HVzTybMZXW15T1s+zy/j9/VkuT2jiu1VZd56KmfM0ZY4pN3qOqQ2+G
         PnPKpMtvs9WQsKACpv/e0BfPXOOm4jgRCWn228c8l66NWwSzSlqT8DZZFdQ3aFlnT5Ao
         LyG1WkwhJJhBIAST/1aIOHdsxBKwqwre78GLSL7IvayUc/Qo45cFbIKxwDvKKiloZRW9
         szTDnnZBV91fj6RmBzdcPGZMhPAzQJjGorVj1yXwGvqVu6spVCvZUT7aUqDnrIJkG59J
         5OB8CscDAQ8csZSjk6LizAoE5mikMD9AKs26uhJLTUUrvkeaPXC8hhVpmBG+4B6wGosD
         +KLw==
X-Gm-Message-State: AOJu0Ywfjhu7diquDcPbFxQJzCMQKUg4fys0MvSZ2rq+KrlC+2yvhO1L
	f7XAuuXHFiusggfmqPMUYbJ+xlvrw7y9MwlRqz1wtg==
X-Google-Smtp-Source: AGHT+IGJg7QkdQsClygNT9r5bEfow7rITYDHt5mrg27t0ux9YcRLbh1cZlAUmEqzdcKgGK7nq2+OP++BamPJYW0zqKQ=
X-Received: by 2002:a17:907:75c5:b0:9a2:1e14:86bd with SMTP id
 jl5-20020a17090775c500b009a21e1486bdmr3622301ejc.65.1692976531308; Fri, 25
 Aug 2023 08:15:31 -0700 (PDT)
MIME-Version: 1.0
References: <20230821205458.1764662-1-yosryahmed@google.com>
 <20230821205458.1764662-4-yosryahmed@google.com> <ZOR6eyYfJYlxdMet@dhcp22.suse.cz>
 <CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com>
 <ZOW2PZN8Sgqq6uR2@dhcp22.suse.cz> <CAJD7tka34WjtwBWfkTu8ZCEUkLm7h-AyCXpw=h34n4RZ5qBVwA@mail.gmail.com>
 <ZOcDLD/1WaOwWis9@dhcp22.suse.cz> <CAJD7tkZby2enWa8_Js8joHqFx_tHB=aRqHOizaSiXMUjvEei4g@mail.gmail.com>
 <CAJD7tkadEtjK_NFwRe8yhUh_Mdx9LCLmCuj5Ty-pqp1rHTb-DA@mail.gmail.com> <ZOhSyvDxAyYUJ45i@dhcp22.suse.cz>
In-Reply-To: <ZOhSyvDxAyYUJ45i@dhcp22.suse.cz>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 25 Aug 2023 08:14:54 -0700
Message-ID: <CAJD7tkYPyb+2zOKqctQw-vhuwYRg85e6v2Y44xWJofHZ+F+YQw@mail.gmail.com>
Subject: Re: [PATCH 3/3] mm: memcg: use non-unified stats flushing for
 userspace reads
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, 
	Muchun Song <muchun.song@linux.dev>, Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>, 
	linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 1D33C4001E
X-Rspam-User: 
X-Stat-Signature: j994i8tx4kmzgqgi5iym46fk1om1a478
X-Rspamd-Server: rspam01
X-HE-Tag: 1692976532-24783
X-HE-Meta: U2FsdGVkX19zoixsrp5lReoMf6iqu/zHYwDnEGoC1b2PLp8KikdBqYaY1KLtCaUNmCVGg44mvrChIHdX1DiKpjG9ICJ2iEafqUGEsY9lNIdtzs8Qm941TzLvvKjpbMiX2KLo7bG9N5+Gq1cfbrFGRDxDey9As24gC41njt3Cci6T+5a0s+o+QIUBK4ytUxIQBncXDibcJb9k80yqjMe2E7c2kUiXw+X5nI3qW05iW9rpu9Mh7rBJLxQfTLifY4i8SBAvdz6eQlfuiPnACpKhwHq8HLKjHynNR+MRNZn2BHfSUX/zHPSUapZubQEKk1IYwLHO2EfT++3lV0pdJ8uOBd9vT/wONF2dpBufXQtBoFz0kgBkAIKeDrfJ4PLRSz6ZWDrDUoGLmFeGQcvP81xgLXUZSw1LhhyjMxQXtAxz3UrJxMswJvcoJ6Jg/y/H7iVCHAFOVPQz2hC7Z/M80VFxTY4P59Qx8KFgRMKDkhoisqvrAkwlN9lfbRIYF9Mauh5iOz1GjtuoEBBg9DRY+dOSq6s45aqvkxMLQbf5zk5xbCPyXySKd4ni1cg+43R1QMmo0sIj5PIDZwMnwABD+IngeU+98tRbRzvsi2Tx/rITQHGSw3TfgkY8J6QNT34jXriIIKJ/iR5HfIV+g3BsAFEOrwxi2YYo+lf5JoGigpZ1Mr2tvga8FPvWPPK6PJjo4tyyP3ieVXRd7u/+BZ8xErFMSNS6jUHyrdhG7ZzUp5vRDQAAZ0HKwx8dJ7lvH20TSN/eTRhqBX7knSKGMcf2UyPIItvAB3MjUNUn6xQuBemykzjDZNHT6JvmPjDcxHemQ69EXBp4cGbeXe4dZmgHGE2IW0RL0PdRGv2KNG7w06693df+QAlNvWg2SmZsUTsWrdOK2lQFJdTTBVWfBZCK972XD31kj2Iy2WaJe+7DOCSym1fVxrAd1z8WbBcMRRkHIAs4yodFCqD0csFDbaazDh0
 YnbpXPzJ
 48nwwFt7CM+1bESzikjzc134cvGQTdEsLhGIVBpM3hwby0ZngGlW21wq79DrSSL1rKDmqboVU5M/kLrIgJcmBkz/lbHA6iNFC289+gWy2qWSF3LlcI4iOUGOQo5x99zNyDDXZaqIyVIPE7t9XQo2hRrkpuwQ72xadG/ept4deMLgMsW2mFrP/gSlojIEo5t4PSPLe3PZ0ek6g7ADZ4jNZ2VTaEePtVJIyGVCsoByjM6p+QMsuYLTwFZsTnUf8pBH/zxrdPw2Ux7E0Lf9Uv9GMmqZW0klXPdl8xP+3JJakBN6Hjo+0jAQ21Rbmjg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Aug 25, 2023 at 12:05=E2=80=AFAM Michal Hocko <mhocko@suse.com> wro=
te:
>
> On Thu 24-08-23 11:50:51, Yosry Ahmed wrote:
> > On Thu, Aug 24, 2023 at 11:15=E2=80=AFAM Yosry Ahmed <yosryahmed@google=
.com> wrote:
> > >
> > > On Thu, Aug 24, 2023 at 12:13=E2=80=AFAM Michal Hocko <mhocko@suse.co=
m> wrote:
> > > >
> > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > > > On Wed, Aug 23, 2023 at 12:33=E2=80=AFAM Michal Hocko <mhocko@sus=
e.com> wrote:
> > > > > >
> > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > > > On Tue, Aug 22, 2023 at 2:06=E2=80=AFAM Michal Hocko <mhocko@=
suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > > So to answer your question, I don't think a random user can r=
eally
> > > > > > > affect the system in a significant way by constantly flushing=
. In
> > > > > > > fact, in the test script (which I am now attaching, in case y=
ou're
> > > > > > > interested), there are hundreds of threads that are reading s=
tats of
> > > > > > > different cgroups every 1s, and I don't see any negative effe=
cts on
> > > > > > > in-kernel flushers in this case (reclaimers).
> > > > > >
> > > > > > I suspect you have missed my point.
> > > > >
> > > > > I suspect you are right :)
> > > > >
> > > > >
> > > > > > Maybe I am just misunderstanding
> > > > > > the code but it seems to me that the lock dropping inside
> > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number =
of
> > > > > > contenders which is really dangerous when it is triggerable fro=
m the
> > > > > > userspace. The number of spinners at a moment is always bound b=
y the
> > > > > > number CPUs but depending on timing many potential spinners mig=
ht be
> > > > > > cond_rescheded and the worst time latency to complete can be re=
ally
> > > > > > high. Makes more sense?
> > > > >
> > > > > I think I understand better now. So basically because we might dr=
op
> > > > > the lock and resched, there can be nr_cpus spinners + other spinn=
ers
> > > > > that are currently scheduled away, so these will need to wait to =
be
> > > > > scheduled and then start spinning on the lock. This may happen fo=
r one
> > > > > reader multiple times during its read, which is what can cause a =
high
> > > > > worst case latency.
> > > > >
> > > > > I hope I understood you correctly this time. Did I?
> > > >
> > > > Yes. I would just add that this could also influence the worst case
> > > > latency for a different reader - so an adversary user can stall oth=
ers.
> > >
> > > I can add that for v2 to the commit log, thanks.
> > >
> > > > Exposing a shared global lock in uncontrolable way over generally
> > > > available user interface is not really a great idea IMHO.
> > >
> > > I think that's how it was always meant to be when it was designed. Th=
e
> > > global rstat lock has always existed and was always available to
> > > userspace readers. The memory controller took a different path at som=
e
> > > point with unified flushing, but that was mainly because of high
> > > concurrency from in-kernel flushers, not because userspace readers
> > > caused a problem. Outside of memcg, the core cgroup code has always
> > > exercised this global lock when reading cpu.stat since rstat's
> > > introduction. I assume there hasn't been any problems since it's stil=
l
> > > there.
>
> I suspect nobody has just considered a malfunctioning or adversary
> workloads so far.

Perhaps that also means it's not a problem in practice, or at least I hope =
so :)

>
> > > I was hoping Tejun would confirm/deny this.
>
> Yes, that would be interesting to hear.
>
> > One thing we can do to remedy this situation is to replace the global
> > rstat lock with a mutex, and drop the resched/lock dropping condition.
> > Tejun suggested this in the previous thread. This effectively reverts
> > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> > since now all the flushing contexts are sleepable.
>
> I would have a very daring question. Do we really need a global lock in
> the first place? AFAIU this locks serializes (kinda as the lock can be
> dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a
> single one ATM). I can see cgroup_base_stat_cputime_show would like to
> have a consistent view on multiple stats but can we live without a
> strong guarantee or to replace the lock with seqlock instead?

Unfortunately, it's more difficult than this. I thought about breaking
down that lock and falled back to this solution. See below.

>
> > My synthetic stress test does not show any regressions with mutexes,
> > and there is a small boost to reading latency (probably because we
> > stop dropping the lock / rescheduling). Not sure if we may start
> > seeing need_resched warnings on big flushes though.
>
> Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock=
")
> it seems the point of moving away from mutex was to have a more usable
> API.

Right, we needed an atomic interface for flushing, but that later
turned out to cause some problems, so we reworked the code such that
we never have to flush atomically. Now we can go back to the mutex if
it makes things better, I am not really sure how much it helps though.

>
> > One other concern that Shakeel pointed out to me is preemption. If
> > someone holding the mutex gets preempted this may starve other
> > waiters. We can disable preemption while we hold the mutex, not sure
> > if that's a common pattern though.
>
> No, not really. It is expected that holder of mutex can sleep and can be
> preempted as well.

Maybe not for this specific case because it's a global mutex and
holding it for too long might cause problems? Is it bad to disable
preemption while holding a mutex?

>
> I might be wrong but the whole discussion so far suggests that the
> global rstat lock should be reconsidered. From my personal experience
> global locks easily triggerable from the userspace are just a receip for
> problems. Stats reading shouldn't be interfering with the system runtime
> as much as possible and they should be deterministic wrt runtime as
> well.

The problem is that the global lock also serializes the global
counters that we flush to. I will talk from the memcg flushing
perspective as that's what I am familiar with. I am not sure how much
this is transferable to other flushers.

On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
synchronizes access to multiple counters, for this discussion what's
most important are:
- The global stat counters of the memcg being flushed (e.g.
memcg->vmstats->state).
- The pending stat counters of the parent being flushed (e.g.
parent->vmstats->state_pending).

To get rid of this lock, we either need to use atomics for those
counters, or have fine-grained locks. I experimented a while back with
atomic and flushing was significantly more expensive. The number of
atomic operations would scale with O(# cgroups * # cpus) and can grow
unbounded.

The other option is fine-grained locks. In this case we would need to
lock both the memcg being flushed and its parent together. This can go
wrong with ordering, and for top-level memcgs the root memcg lock will
become the new global lock anyway. One way to overcome this is to
change the parent's pending stat counters to also be percpu. This will
increase the memory usage of the stat counters per-memcg, by hundreds
of bytes per cpu.

Let's assume that's okay, so we only need to lock one cgroup at a
time. There are more problems.

We also have a percpu lock (cgroup_rstat_cpu_lock), which we use to
lock the percpu tree which has the cgroups that have updates on this
cpu. It is held by both flushing contexts and updating contexts (hot
paths). Ideally we don't want to spin on a per-cgroup (non percpu)
lock while holding the percpu lock, as flushers of different cpus will
start blocking one another, as well as blocking updaters. On the other
hand, we need to hold percpu lock to pop a cgroup from that tree and
lock it. It's a chicken and egg problem. Also, if we release the
percpu lock while flushing, we open another can of worms:
(a) Concurrent updates can keep updating the tree putting us in an
endless flushing loop. We need some sort of generation tracking for
this.
(b) Concurrent flushing can flush a parent prematurely on the same cpu
as we are flushing a child, and not get the updates from the child.

One possible scheme to handle the above is as follows:
1. Hold the percpu lock, find the cgroup that needs to be flushed next.
2. Trylock that cgroup. If we succeed, we flush it with both the
percpu and the per-cgroup locks held.
3. If we fail, release the percpu lock and spin on the per-cgroup lock.
4. When we get the per-cgroup lock, take the percpu lock again, and
make sure that the locked cgroup is still the correct cgroup to flush.
If not, repeat.
5. Flush the cgroup, and go back to step 1 to get the next cgroup.

Of course this is complex and error-prone, and might introduce
significant overheads due to the number of locks we need to take
compared with what we currently have.

I guess what I am trying to say is, breaking down that lock is a major
surgery that might require re-designing or re-implementing some parts
of rstat. I would be extremely happy to be proven wrong. If we can
break down that lock then there is no need for unified flushing even
for in-kernel contexts, and we can all live happily ever after with
cheap(ish) and accurate stats flushing.

I really hope we can move forward with the problems at hand (sometimes
reads are expensive, sometimes reads are stale), and not block fixing
them until we can come up with an alternative to that global lock
(unless, of course, there is a simpler way of doing that).

Sorry for the very long reply :)

Thanks!