From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3C064C001DB
	for <linux-mm@archiver.kernel.org>; Tue, 15 Aug 2023 00:29:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B5A94900010; Mon, 14 Aug 2023 20:29:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B0A1C90000B; Mon, 14 Aug 2023 20:29:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9D152900010; Mon, 14 Aug 2023 20:29:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 8B53290000B
	for <linux-mm@kvack.org>; Mon, 14 Aug 2023 20:29:02 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 46FBDC0C89
	for <linux-mm@kvack.org>; Tue, 15 Aug 2023 00:29:02 +0000 (UTC)
X-FDA: 81124454124.21.A30C35F
Received: from mail-ej1-f47.google.com (mail-ej1-f47.google.com [209.85.218.47])
	by imf11.hostedemail.com (Postfix) with ESMTP id 7512C40002
	for <linux-mm@kvack.org>; Tue, 15 Aug 2023 00:29:00 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="MMfQ/rxh";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692059340; a=rsa-sha256;
	cv=none;
	b=3OS6Bvf5Eqci2/zfbeiQOweJy16m4S7TDxA9upmhTQEtX7zkWpduumo4I9RU1QLPFUlrOb
	TS+I+Fn4jG4csQ++Y375Lvcz6w3T2+8n7EKGb8P2tFDgtwTDk+UedpgAynmUGjxQ0dPZYC
	JMGTOttE/0lusvwee6HEVBXc9gnmCvI=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="MMfQ/rxh";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.47 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1692059340;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=s1lKc8UCS20HXuR6dZ0rRLPXdPOi2Gp9kpdm+wPioZA=;
	b=j0VKZot7MswUlrWq7vTrw4ES+CGjGbvnYvI0Z0DBZDvRiX2KlBhXJo78Sfu9DLrBd+dnYE
	sbDMhi38HBY3WyhAi3xE7shUZHxPp8MnkTAZT0l9O/cNZWzRZE+hlJ2+9EqaUmkVnnlap7
	u0YVD8h69rX3W0tSNFpehMKI4Iv8b+c=
Received: by mail-ej1-f47.google.com with SMTP id a640c23a62f3a-99bf1f632b8so691591866b.1
        for <linux-mm@kvack.org>; Mon, 14 Aug 2023 17:29:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1692059339; x=1692664139;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=s1lKc8UCS20HXuR6dZ0rRLPXdPOi2Gp9kpdm+wPioZA=;
        b=MMfQ/rxhV+SsA65fGuBjuQe0apk2cIeqwqBhMkO+8zOqb5kPwPqW9O3Zb/4RzJVlwf
         ic+ugnVvM/Jgs29lFDzYaYGp6IsQx7LrQTKlPaeaRDyNDCvCAzfanGOhQQfKSzQsBB13
         3e/YVG6A9CbUlf8V5a3jAlnuOFRYse+svOjPsaBEhK9RhWzg8ua9loRuGHQlTWaH44vg
         nbyNPZbFGZO/I30U8nrkBrs3llTa2l4HtNWN2KwrsUFPYx1ZXHw3i9W2bNveRZxUKzgs
         IG2O/dIiKNAgfU/voy1SjkaBYfrfyRM/nQzsiF2r7WR2+WhXcXfL8w3PjZ9uJUK9pwDq
         KBAw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1692059339; x=1692664139;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=s1lKc8UCS20HXuR6dZ0rRLPXdPOi2Gp9kpdm+wPioZA=;
        b=k7U3phRiYatUb82bVk6lKtu7z3TAFCW13ms6rwnr08p7zIr1KjbLHB3agvzy2nxywr
         xgtFU4TcPqHsT8zgkdw7POB2RNC0QTYazepCEMcQk6oGqbiuIfrjVOTbRyior+NqX/jh
         zPnEJWtgh6WogA67o6Uloyrzf+trpo2RLY8oiiIHwmH/tUrd/fBLbxU3o6aryeT4SOcP
         AJwo6kNJHyMgA0peGYGTBMqiPbpfAKXZGXJGYfI45l2U9gbIqKJznVD5Km9b8mb/uyNw
         ex1EYG+ZDEWNOl/DE6vopTa5HqVPO+oAzbjhkq+Q7g2Dsr1wvmrzkxk/T10JYvzhFMti
         z3bA==
X-Gm-Message-State: AOJu0YyiXcXoaMynr5iuUv1PG4iNtVksoB17XUGi6DbsC5tGMcprDklZ
	hRqwy+l22TEetyfLej1sYZEsHvtJaYTEvuUiZ8una+IEkelwZ1tesrT2Yg==
X-Google-Smtp-Source: AGHT+IHzmE288zdevCwMqYYaWsmxRvV8FyIxGkMeUXfc3+nTin6UrBX+Yo5BFt164qJKfgmaUmrSvfMdnaEsHW961ZA=
X-Received: by 2002:a17:906:5307:b0:994:673:8af6 with SMTP id
 h7-20020a170906530700b0099406738af6mr9551032ejo.29.1692059338726; Mon, 14 Aug
 2023 17:28:58 -0700 (PDT)
MIME-Version: 1.0
References: <CAJD7tkbJ1fnMDudtsS2xubKn0RTWz7t0Hem=PSRQQp3sGf-iOw@mail.gmail.com>
 <ZNaLGVUtPu7Ua/jL@dhcp22.suse.cz> <CAJD7tkbF1tNi8v0W4Mnqs0rzpRBshOFepxFTa1SiSvmBEBUEvw@mail.gmail.com>
 <CALvZod55S3XeK-MquTq0mDuipq8j0vFymQeX_XnPb_HuPK+oGQ@mail.gmail.com>
 <CAJD7tkYZxjAHrodVDK=wmz-sULJrq2VhC_5ecRP7T-KiaOcTuw@mail.gmail.com>
 <CALvZod46Cz_=5UgiyAKM+VgKyk=KJCqDqXu91=9uHy7-2wk53g@mail.gmail.com>
 <CAJD7tkY-ezyYebvcs=8Z_zrw2UVW8jf2WvP1G8tu2rT=2sMnAA@mail.gmail.com>
 <CALvZod5fH9xu_+6x85K38f63GfKGWD1LqtD2R4d09xmDtLB7ew@mail.gmail.com>
 <ZNdEaw2nktq1NfmH@dhcp22.suse.cz> <CAJD7tkaFHgc3eN1K1wYsQFWMLu4+Frf9DJ-5HOja2nC20Es9Dw@mail.gmail.com>
 <ZNrDWqfjXtAYhnvT@slm.duckdns.org>
In-Reply-To: <ZNrDWqfjXtAYhnvT@slm.duckdns.org>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Mon, 14 Aug 2023 17:28:22 -0700
Message-ID: <CAJD7tkYBFz-gZ2QsHxUMT=t0KNXs66S-zzMPebadHx9zaG0Q3w@mail.gmail.com>
Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads
To: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>, Shakeel Butt <shakeelb@google.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, Muchun Song <muchun.song@linux.dev>, 
	cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 7512C40002
X-Stat-Signature: 91xhr3jufsp36zuia6dcqprxmeirqw45
X-HE-Tag: 1692059340-531722
X-HE-Meta: U2FsdGVkX1+ecgKxDV074YDPbtrSVGpt/5Zyr6ZWi+8+ydR4u5SoJAA+C3M0A/5NWLBbP7QLqLciLZ4sn7c5kkSOKOf84pa6ysBxGBhFKdiX2OohkIF4gFbLKtxOJHmZswFhnhXeUIHNLAyYetqS9YXIuA0oTPi7jzpNkAJ+LeI1tahVs3cgBgdCRgJGLckle58cCvZdAyEl3Wl+q9gdPTBkdg7XHxzc/7wmOGqNVu+C5Zt3sCoqqOYeYAj7rWoD6pXuuIZd7wHiWDOhCQHfpHIVrP1ISyBEil5vV8k0zxeNLtVamhKaQaK0shBNZaE7oGNylrdZO9y3O4mW71DvGWT6OBB8r8k/HaOdxvwVkCHceEniad2QkBbuBJmpB+hbMOm41zUYe1qCSVwSPsD5L/MOktHKtJtzcjqgjA/mquK7+OylQiQkddQJ0NFPtmhDlm5Q2RAjnqCm/e7SRypM/EjizbvNemejjFcWNSTJcbut7MkKYAA9MELo8y9gk0q6KVox9Wtrdx95/GMO2u4wp+ntnR2LMy5f9lOWA0Buz8I2SZbTNRVhrsowYMXK41TUpvUMh5jz/W7yecIJ6kxWKQu7bo1OMUu0dtD1Lvk3+CpAyWMafAbA8Y6ryfNlrzB98QO64GHEuAE54bPYyAzKuWmj6dbMK3Q4JgzIv6MYwYPXhoRmVY+1ci7bY3zAb3tgajH4gm5fSkNWnDEaNcNtD2jnK/B8tJiq0kBdeYKw9B0spJNVO9PaGRNlDcBjoD3JSHWR81UG+LQqalr+kFnzugC1+pTixVoqMrzDz8LsPMXJlvLIhhI+ujzwyCeD1+f3NdUXxfdV4VUVborKuonJ8wYuyVdVq8qn9IzEL+WGkc3VqSf3bgAD7SkimZ+tlpdHIKGz8q+Pns0kLK0Dx9fX8M58+qznTxYSIn0XC3Ka9+Lf8Z1Zz0s2KElHOX4NLRp2MD3E8AuJCdmVAep4Gmz
 E+N4NLq+
 pj/aFMQFXCsDb2FrUSiq93fcrqE2ePEN5ciUtnS+us6NHKYauQlhnBgTgLKJz/Rm9FznH0Ezk6wH/yHJxx3N65uETkguqCcf4eyXO2ElCFBh8355Sam/gQwRvv6InLQDGjEW5goXYc9MAawMR/lpCz4Nq1SanVl+giolTXVGEuL6pwZfNJg7bkeTGQVJI2L3PuCu9nmMuEHSJKqVKR2rcjBqxoBKJYkTnmBt2uyQuQYxq7I5mQHRFEYkf9ino1nrvVHBuCyK78GX7/UDHxe/e92S7GXECHnSffxynfXfaZuG8w3P8ja7hQ5qTDw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Aug 14, 2023 at 5:14=E2=80=AFPM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Sat, Aug 12, 2023 at 04:04:32AM -0700, Yosry Ahmed wrote:
> > Taking a step back though, and considering there have been other
> > complaints about unified flushing causing expensive reads from
> > memory.stat [1], I am wondering if we should tackle the fundamental
> > problem.
> >
> > We have a single global rstat lock for flushing, which protects the
> > global per-cgroup counters as far as I understand. A single lock means
> > a lot of contention, which is why we implemented unified flushing on
> > the memcg side in the first place, where we only let one flusher
> > operate and everyone else skip, but that flusher needs to flush the
> > entire tree.
> >
> > This can be unnecessarily expensive (see [1]), and to avoid how
> > expensive it is we sacrifice accuracy (what this patch is about). I am
> > exploring breaking down that lock into per-cgroup locks, where a
> > flusher acquires locks in a top down fashion. This allows for some
> > concurrency in flushing, and makes unified flushing unnecessary. If we
> > retire unified flushing we fix both accuracy and expensive reads at
> > the same time, while not sacrificing performance for concurrent
> > in-kernel flushers.
> >
> > What do you think? I am prototyping something now and running some
> > tests, it seems promising and simple-ish (unless I am missing a big
> > correctness issue).
>
> So, the original design used mutex for synchronize flushing with the idea
> being that updates are high freq but reads are low freq and can be
> relatively slow. Using rstats for mm internal operations changed this
> assumption quite a bit and we ended up switching that mutex with a lock.

Naive question, do mutexes handle thundering herd problems better than
spinlocks? I would assume so but I am not sure.

>
> Here are some suggestions:
>
> * Update-side, per-cpu lock should be fine. I don't think splitting them
>   would buy us anything meaningful.

I agree, I was mainly concerned with the global flushing lock.

>
> * Flush-side, maybe we can break flushing into per-cpu or whatnot but
>   there's no avoiding the fact that flushing can take quite a while if th=
ere
>   are a lot to flush whether locks are split or not. I wonder whether it'=
d
>   be possible to go back to mutex for flushing and update the users to
>   either consume the cached values or operate in a sleepable context if
>   synchronous read is necessary, which is the right thing to do anyway gi=
ven
>   how long flushes can take.

Unfortunately it cannot be broken down into per-cpu as all flushers
update the same per-cgroup counters, so we need a bigger locking
scope. Switching to atomics really hurts performance. Breaking down
the lock to be per-cgroup is doable, but since we need to lock both
the parent and the cgroup, flushing top-level cgroups (which I assume
is most common) will lock the root anyway.

All flushers right now operate in sleepable context, so we can go
again to the mutex if you think this will make things better. The
slowness problem reported recently is in a sleepable context, it's
just too slow for userspace if I understand correctly.

+Ivan Babrou

What I am thinking about now is that since all flushers are sleepable,
perhaps the thundering herd problem would be less severe, since we may
release the lock (or mutex) at the cpu boundaries. I wonder if would
be better if we retire the unified flushing on the memcg side, so that
not all flushers need to flush the entire tree, and we allow
concurrent flushing.

This should address the slowness in reads (as proven by a patch by
Ivan [1]), and it should also address the inaccuracy addressed by this
thread, since no flushers will skip if someone else is flushing.

I am trying to test if there are any regressions by running some
synthetic stress testing (reclaim, refault, read stats, repeat), so
far I can't see any.

Two things that we will need to figure out if we retire unified flushing:
(a) We now have a global "stats_flush_threshold" variable to know when
to flush and when to skip. If flushing is not global, we need to make
this per-cgroup or retire it as well. If we make it per-cgroup it may
affect the update-side, and we will need to move it to the rstat code
I think.

(b) We now have a global "flush_next_time" to know whether the
ratelimited flusher should run or not. If we keep it, only the global
async flusher will kill it forward, sync flushers will not. Otherwise
we can also make it per-cgroup and update it during flushes.

[1]https://github.com/bobrik/linux/commit/50b627811d54
>
> Thanks.
>
> --
> tejun