From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A6B36EB64DD
	for <linux-mm@archiver.kernel.org>; Wed,  9 Aug 2023 12:31:45 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B26966B0071; Wed,  9 Aug 2023 08:31:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AB0078E0002; Wed,  9 Aug 2023 08:31:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 929AD8E0001; Wed,  9 Aug 2023 08:31:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 8037B6B0071
	for <linux-mm@kvack.org>; Wed,  9 Aug 2023 08:31:44 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 0B8B01A0EBF
	for <linux-mm@kvack.org>; Wed,  9 Aug 2023 12:31:44 +0000 (UTC)
X-FDA: 81104502528.13.458A30C
Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54])
	by imf15.hostedemail.com (Postfix) with ESMTP id 26118A0025
	for <linux-mm@kvack.org>; Wed,  9 Aug 2023 12:31:41 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=z87ipBGX;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691584302; a=rsa-sha256;
	cv=none;
	b=XK+d9ggI2pOW3TTWRhvLD4bCFxeYGkKSozUlOWyHJBmFutQOJT0fdZR6nje+JV/y/J2YI6
	81ehoAmLRLgMgi3kjlrLV0l8ep6tg2CIe+ooJ504wniQzT2fu0oyytOL0hvKooH9OSibl2
	7JRbDnbaoHJL9pk1xhWPL9JXClS0s0o=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=z87ipBGX;
	spf=pass (imf15.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1691584302;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ovz1LQ6CtkIwOn2X9qPamiwiMkn0NFocCNlrpAkV6FE=;
	b=hIMgaiHb32qkEAchESWax7z35VviX/aq1rilH0ZpmxYusuqaYaPBvIwfvEjfkGw79O2o40
	rHkfuHnGfENRg/PzYPjOlwIJMIQ8hRVDbRos5Ph9VSY6Vubgzh3x+VjrUjI54fz0+yRPtf
	/uyZeZ9Lz6m6M+5rvlCfBKFaicaSuR4=
Received: by mail-ed1-f54.google.com with SMTP id 4fb4d7f45d1cf-5234f2c6c1dso1308819a12.1
        for <linux-mm@kvack.org>; Wed, 09 Aug 2023 05:31:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1691584300; x=1692189100;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ovz1LQ6CtkIwOn2X9qPamiwiMkn0NFocCNlrpAkV6FE=;
        b=z87ipBGX3qL/8LN9fCqtHDrLvXwt7S+tuWniA0tiSN+CNh6VbxxtBmQyhlh+HSLCWN
         MokZhrhhkDyQWUhh9ddK3revJvWDVvj2sxHC4KXjeDpHTshiyAuc70hwHoBgd0D3IuBu
         5k9b808R7uSlBLTvoexBmRyAkLz6yt2iieSuBEiwwYnarzGFdq2UbjWDrhJYDGYDKrzU
         Ngn0OvDY/ASZZNoBb+Po1N546t3BYGJ/Kndgkbj6cNDyIWWYTuWeiikDfIujTd9k81IP
         ioqXYMTlg8TrKl0DP1iGK/J5PrbeS+eADjISSm/JLbcsqPhfWZABOt3x9Ej67B+2ocPB
         8qXw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691584300; x=1692189100;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ovz1LQ6CtkIwOn2X9qPamiwiMkn0NFocCNlrpAkV6FE=;
        b=P3g7Cd1rk889OYZkSEuxP0nhyqqH5NAEDOtNm48HCc/KtcHmOORqm0AwzsvKDlMUBt
         J0T4gM4puXaz/fCUKldPgekW2vt2Zz1BJP6i120rfOsVos+gtDhlqzNHluP5N+WyMP+Z
         V/YgxFNroA0Dy8ge+6RuC7OQAoHnIJHTbo/foCQQoX++FnNdIIix6Ikvvcwg1/o4NEzD
         rF9p4Hrushz5jwacnp0Rmb+mKvi+KZwXI6Q6OEmVabJmbS1Ml3bat033ZMI2ZXUspTP6
         yKy8WKWIjBxeJsWH7gKhmMRzfwJxE00sLSxfIuwQDoC19qZ0rbXyoK/KKVR8Iv4EMqcX
         24eg==
X-Gm-Message-State: AOJu0YywWyoBHeatxNQlQOkna8kSi47fDneaVx9KInsUZPxy6duWQn2t
	fKU/bxXXi4aDUYytBblxKc/+CqSJBrC+UUZUMrkcGA==
X-Google-Smtp-Source: AGHT+IF/mGXH5UfrJX6nELy6/81vEx+XhPbo08yywYMlK/WQ2uPYFAWUssEQqHCxlyDePoNGc3wEnaAzdkQ+vNM8KCw=
X-Received: by 2002:a17:906:738d:b0:99c:ac84:663b with SMTP id
 f13-20020a170906738d00b0099cac84663bmr2069551ejl.65.1691584300324; Wed, 09
 Aug 2023 05:31:40 -0700 (PDT)
MIME-Version: 1.0
References: <20230809045810.1659356-1-yosryahmed@google.com> <ZNNTgZVPZipTL/UM@dhcp22.suse.cz>
In-Reply-To: <ZNNTgZVPZipTL/UM@dhcp22.suse.cz>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 9 Aug 2023 05:31:04 -0700
Message-ID: <CAJD7tkYhxbd2e+4HMZVKUfD4cx6oDauna3vLmttNPLCmFNtpgA@mail.gmail.com>
Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeelb@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Muchun Song <muchun.song@linux.dev>, cgroups@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 26118A0025
X-Stat-Signature: 9r77ssx3p16sysjm49aj9ioypk8q5dyo
X-Rspam-User: 
X-HE-Tag: 1691584301-555718
X-HE-Meta: U2FsdGVkX1+KXWnoprWrZbHQfoVnP72s62fEObBR08smkRHl4vaCJedAuiEzjhiGx1VV5XcC4YJRda6Z0+YOSK7BKEi1g8yx0tRaHDjR5E487HLljjL6Yxh/FcvCXdHJpYIqbD2kyAdWRG3YFSW88ASEbdD/AVB3JtR6cpK/kDVs0uVTJXijKg5tVTOYd9hMioGrQlLmGzBDRGYiiJ2O6m99wTrm/bPJMKnGRJZiMz9O4iXZHj9Yd+p6phS4D19V5pjul2jacZJHXAf8u68DoFOXvVjMkVSHS1N2PtTgKBKERKXgRDHxOUVzQvmuhRdlZRNVkuT+62oKjgj/AdpBpM/IIM7pHFMMctrftCw/MxTQtK+QOKwC7LMGanfzPGJsEPd3VE/MIpKbzH+0+uDjP2GGMedeJS+pL86l5UdIxKYvaeYuiRULf9OQEClSWsjJS6BgwGnnG/hMS4YjTp/RSljaVWvflK6hWdMUVHgSh5k7iPE75Cl1UTp8yJT8k7nRkF+D/64bLK1QzjK84EeDkZ8gqRiaetVUAET/4nkB0goI47AQxDlK9x3eom3i0zn86Iuo2oX1kzaqZzrPUQNVJX9JvBQjwi2mbgGiV6U2iypUX4/btJ0tIbeYrtgUCtzblykMgs97Hq3H7+TR+uQZ/dyi7dl6JFvyYd66CEhDK+kEy3oRwlywKugl+WKi9zw0fY9o0ypJTOaxB+D6ONHRBCbi+Xgr5M8A1ACMu7EX5VsUiGJlS8oI2cuiDDGZSU+CzHTuooFemPQBpd33bWf43gilJDvodhYmTabsnfW+vAVwU99L/TWH1wrDdcWAwLVYWnJmM+sM+m2maEw7JAUDe1MftzeCvAjSk1T6OI5T8huexsAWBltZ+Zm2GLjCLThzJHRz8wuj0+i98gGr3DCexCIVDt4CRCPMrDKZIy+7KUthhJDmSiSl9jcvPmbBqjQ1p+vbqIc5sSjTkZJCurR
 iGRS/0eD
 AdVlZpb2GEA+5pkpvvT6PNdG0e2i5MYuMnurtexbbuEyq2XCa7/xzyJkANt15dkG2oUNgnKqRhIPd95dDWNC8fdjdq+wCqbtJyMgPYvSzCY+RcSiR63XGf1NoNXUdmo9zbN3HXUCBBEDE05itczj53/edCt03ZHh1k9rs6VC3O2UjL9BUUyHsJdzG6FI/1tARjD/736xPfB/1YMZBzA04EOs+oyp/+lc1kHVMWfpp7rFY2JShlAWXnFMYf+XbkfPG9VlAQ7CMS6VQrfJBuVrmTJiTkw55nBC27C1e
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Aug 9, 2023 at 1:51=E2=80=AFAM Michal Hocko <mhocko@suse.com> wrote=
:
>
> On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > Over time, the memcg code added multiple optimizations to the stats
> > flushing path that introduce a tradeoff between accuracy and
> > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > full rstat flush of the stats in the tree can be too expensive. Such
> > optimizations include [1]:
> > (a) Introducing a periodic background flusher to keep the size of the
> > update tree from growing unbounded.
> > (b) Allowing only one thread to flush at a time, and other concurrent
> > flushers just skip the flush. This avoids a thundering herd problem
> > when multiple reclaim/refault threads attempt to flush the stats at
> > once.
> > (c) Only executing a flush if the magnitude of the stats updates exceed=
s
> > a certain threshold.
> >
> > These optimizations were necessary to make flushing feasible in
> > performance-critical paths, and they come at the cost of some accuracy
> > that we choose to live without. On the other hand, for flushes invoked
> > when userspace is reading the stats, the tradeoff is less appealing
> > This code path is not performance-critical, and the inaccuracies can
> > affect userspace behavior. For example, skipping flushing when there is
> > another ongoing flush is essentially a coin flip. We don't know if the
> > ongoing flush is done with the subtree of interest or not.
>
> I am not convinced by this much TBH. What kind of precision do you
> really need and how much off is what we provide?
>
> More expensive read of stats from userspace is quite easy to notice
> and usually reported as a regression. So you should have a convincing
> argument that an extra time spent is really worth it. AFAIK there are
> many monitoring (top like) tools which simply read those files regularly
> just to show numbers and they certainly do not need a high level of
> precision.

We used to spend this time before commit fd25a9e0e23b ("memcg: unify
memcg stat flushing") which generalized the "skip if ongoing flush"
for all stat flushing. As far I know, the problem was contention on
the flushing lock which also affected critical paths like refault.

The problem is that the current behavior is indeterministic, if cpu A
tries to flush stats and cpu B is already doing that, cpu A will just
skip. At that point, the cgroup(s) that cpu A cares about may have
been fully flushed, partially flushed (in terms of cpus), or not
flushed at all. We have no idea. We just know that someone else is
flushing something. IOW, in some cases the flush request will be
completely ignored and userspace will read stale stats (up to 2s + the
periodic flusher runtime).

Some workloads need to read up-to-date stats as feedback to actions
(e.g. after proactive reclaim, or for userspace OOM killing purposes),
and reading such stale stats causes regressions or misbehavior by
userspace.

>
> [...]
> > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem=
_cgroup *memcg, int val)
> >       }
> >  }
> >
> > -static void do_flush_stats(void)
> > +static void do_flush_stats(bool full)
> >  {
> > +     if (!atomic_read(&stats_flush_ongoing) &&
> > +         !atomic_xchg(&stats_flush_ongoing, 1))
> > +             goto flush;
> > +
> >       /*
> > -      * We always flush the entire tree, so concurrent flushers can ju=
st
> > -      * skip. This avoids a thundering herd problem on the rstat globa=
l lock
> > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      * We always flush the entire tree, so concurrent flushers can ch=
oose to
> > +      * skip if accuracy is not critical. Otherwise, wait for the ongo=
ing
> > +      * flush to complete. This avoids a thundering herd problem on th=
e rstat
> > +      * global lock from memcg flushers (e.g. reclaim, refault, etc).
> >        */
> > -     if (atomic_read(&stats_flush_ongoing) ||
> > -         atomic_xchg(&stats_flush_ongoing, 1))
> > -             return;
> > -
> > +     while (full && atomic_read(&stats_flush_ongoing) =3D=3D 1) {
> > +             if (!cond_resched())
> > +                     cpu_relax();
>
> You are reinveting a mutex with spinning waiter. Why don't you simply
> make stats_flush_ongoing a real mutex and make use try_lock for !full
> flush and normal lock otherwise?

So that was actually a spinlock at one point, when we used to skip if
try_lock failed. We opted for an atomic because the lock was only used
in a try_lock fashion. The problem here is that the atomic is used to
ensure that only one thread actually attempts to flush at a time (and
others skip/wait), to avoid a thundering herd problem on
cgroup_rstat_lock.

Here, what I am trying to do is essentially equivalent to "wait until
the lock is available but don't grab it". If we make
stats_flush_ongoing a mutex, I am afraid the thundering herd problem
will be reintroduced for stats_flush_ongoing this time.

I am not sure if there's a cleaner way of doing this, but I am
certainly open for suggestions. I also don't like how the spinning
loop looks as of now.