From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 511F8C433F5
	for <linux-mm@archiver.kernel.org>; Wed,  5 Oct 2022 01:18:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D34766B0072; Tue,  4 Oct 2022 21:18:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CE55E6B0073; Tue,  4 Oct 2022 21:18:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B86E18E0006; Tue,  4 Oct 2022 21:18:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id A3CFF6B0072
	for <linux-mm@kvack.org>; Tue,  4 Oct 2022 21:18:18 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 72A5C160D41
	for <linux-mm@kvack.org>; Wed,  5 Oct 2022 01:18:18 +0000 (UTC)
X-FDA: 79985135076.08.98F712B
Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54])
	by imf06.hostedemail.com (Postfix) with ESMTP id 137DA180010
	for <linux-mm@kvack.org>; Wed,  5 Oct 2022 01:18:17 +0000 (UTC)
Received: by mail-wr1-f54.google.com with SMTP id bu30so3867304wrb.8
        for <linux-mm@kvack.org>; Tue, 04 Oct 2022 18:18:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :mime-version:from:to:cc:subject:date;
        bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=;
        b=dQCLd9ZJb+EU+6pDOHcUcK51l9MyM2MRIxnvcivW7rzZ5K/tAQ2zNSu642hI3JM2tm
         oFje2cbkO1YyMErawMODJhBrkUVw6tG/bK8qdxXgZ7SmhzKzvLiVErxtfAuW1Jc4lWgf
         lY+3hMTG7uq1GJagqorBTId8QPuvQzyIo1781rgDMoGMklyq/NJLZ5N8tjHwF09fR+iv
         XVx8J/+8UXCN68Av4/LqigIqaeo5xgrXWfDeXXkJ30BY0j2R2EU9m+ZF0h2hrMxw0BDk
         hYnxwiZcA3Ff5WWqFl+hhLdf3IBJBIDEWjf6iE2nPNHgPKboiGl5uwL7mOm6nqkjZbRa
         lRUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :mime-version:x-gm-message-state:from:to:cc:subject:date;
        bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=;
        b=4JWRIeyUNcigMVHstGgl5H3hpYB/dBOIes4kTBg345esUsstJv+l9TnOCBYYTjHShd
         u0PLw1l2fF8Klw+O17T83sOnIty3elL5JHkX4jjSwx6K+op3ejV+V4Ie06z9pnuJkFlV
         tILceWvppLn4f8lom4uZYQBWsu2GTUUc8nL4v53ioyYZZo7r7L1DxlK1qGD1Rz3qmrdP
         rXfH0EpcGeNdAzixhAHXPvmJImt+y6Oh7gWFlimRdjZRK1xCVfLYit5nA2KqHjFpTqAl
         WL0yLtmu8e7AWWS+1qyLEzWSj9WT1cSweQd4lO7YoMOilQEZRawOI+G9k6rLYnvE9ARe
         QqOA==
X-Gm-Message-State: ACrzQf03ioO+wo35vYUwivq2dT1Oy50aVV44Cs6Szt/Uiz5wt5qu6WJZ
	tbLr+sUYktkvCpAPkkEu9ZNo0Uvp5FRIez7hE7C79w==
X-Google-Smtp-Source: AMsMyM6OGP/s1SpALxv2OkDHEUxe9vopUQOnSoETGSJG4EwxZQ9suMXs+u4+HeaI9Wv2KX/5JpyDQRnOwvoXv5Iy6ck=
X-Received: by 2002:a5d:6741:0:b0:22e:2c5c:d611 with SMTP id
 l1-20020a5d6741000000b0022e2c5cd611mr11869611wrw.210.1664932696423; Tue, 04
 Oct 2022 18:18:16 -0700 (PDT)
MIME-Version: 1.0
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 4 Oct 2022 18:17:40 -0700
Message-ID: <CAJD7tkZQ+L5N7FmuBAXcg_2Lgyky7m=fkkBaUChr7ufVMHss=A@mail.gmail.com>
Subject: [RFC] memcg rstat flushing optimization
To: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Shakeel Butt <shakeelb@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	=?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>, 
	Cgroups <cgroups@vger.kernel.org>, Greg Thelen <gthelen@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1664932698;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=;
	b=d9nsqWq1NxLp7qaFhXScHg919KWfFWO+UP2thy8fioct7PzHn6jX8gG7sudhDPeEJtL49Q
	vLoHwo2WzhpgDRpGwveF5pGHq9m/ziHXpUBuB6OjA7kBM+lz8PvePi/QcIFXWVnrkjlu0U
	6dY+ocQHbZoP6+2XEBN+gQEUqxFo2yA=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=dQCLd9ZJ;
	spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664932698; a=rsa-sha256;
	cv=none;
	b=dQiDd9pk24eJd9m+U2R7qYoP12PFIn2FJhQMz5hNgHIZe+1XMREz1uSIFsKaxneebMCv/s
	3jTUtks/HBD0jtNJ7pK0ASOTPzSXsRuj6XyT7P0PAngQY9d8FvSEffADw3u8zsGS5ThYb7
	sYKtkGniGTux9+bogkY8ff2Do4sRc14=
X-Stat-Signature: pbd3fjcmtuxkpyo6me6kduihcdgekkt5
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 137DA180010
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=dQCLd9ZJ;
	spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspam-User: 
X-HE-Tag: 1664932697-672788
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hey everyone,

Sorry for the long email :)

We have recently ran into a hard lockup on a machine with hundreds of
CPUs and thousands of memcgs during an rstat flush. There have also
been some discussions during LPC between myself, Michal Koutn=C3=BD, and
Shakeel about memcg rstat flushing optimization. This email is a
follow up on that, discussing possible ideas to optimize memcg rstat
flushing.

Currently, mem_cgroup_flush_stats() is the main interface to flush
memcg stats. It has some internal optimizations that can skip a flush
if there hasn't been significant updates in general. It always flushes
the entire memcg hierarchy, and always invokes flushing using
cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does
not sleep. As you can imagine, with a sufficiently large number of
memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
in an extreme case like the one we ran into, cause a hard lockup
(despite periodically flushing every 4 seconds).

(a) A first step might be to introduce a non _irqsafe version of
mem_cgroup_flush_stats(), and only call the _irqsafe version in places
where we can't sleep. This will exclude some contexts from possibly
introducing a lockup, like the stats reading context and the periodic
flushing context.

(b) We can also stop flushing the entire memcg hierarchy in hopes that
flushing might happen incrementally over subtrees, but this was
introduced to reduce lock contention when there are multiple contexts
trying to flush memcgs stats concurrently, where only one of them will
flush and all the others return immediately (although there is some
inaccuracy here as we didn't actually wait for the flush to complete).
This will re-introduce the lock contention. Maybe we can mitigate this
in rstat code by having hierarchical locks instead of a global lock,
although I can imagine this can quickly get too complicated.

(c) One other thing we can do (similar to the recent blkcg patch
series [1]) is keep track of which stats have been updated. We
currently flush MEMCG_NR_STATS + MEMCG_NR_EVENTS (thanks to Shakeel) +
nodes * NR_VM_NODE_STAT_ITEMS. I didn't make the exact calculation but
I suspect this easily goes over a 100. Keeping track of updated stats
might be in the form of a percpu bitmask. It will introduce some
overhead to the update side and flush sides, but it can help us skip a
lot of up-to-date stats and cache misses. In a few sample machines I
have found that every (memcg, cpu) pair had less than 5 stats on
average that are actually updated.

(d) Instead of optimizing rstat flushing in general, we can just
mitigate the cases that can actually cause a lockup. After we do (a)
and separate call sites that actually need to disable interrupts, we
can introduce a new selective flush callback (e.g.
cgroup_rstat_flush_opts()). This callback can flush only the stats we
care about (bitmask?) and leave the rstat tree untouched (only
traverse the tree, don't pop the nodes). It might be less than optimal
in cases where the stats we choose to flush are the only ones that are
updated, and the cgroup just remains on the rstat tree for no reason.
However, it effectively addresses the cases that can cause a lockup by
only flushing a small subset of the stats.

(e) If we do both (c) and (d), we can go one step further. We can make
cgroup_rstat_flush_opts() return a boolean to indicate whether this
cgroup is completely flushed (what we asked to flush is all what was
updated). If true, we can remove the cgroup from the rstat tree.
However, to do this we will need to have separate rstat trees for each
subsystem or to keep track of which subsystems have updates for a
cgroup (so that if cgroup_rstat_flush_opts() returns true we know if
we can remove the cgroup from the tree or not).

Of course nothing is free. Most of the solutions above will either
introduce overhead somewhere, complexity, or both. We also don't have
a de facto benchmark that will tell us for sure if a change made
things generally better or not, as it will vastly differ depending on
the setup, the workloads, etc. Nothing will make everything better for
all use cases. This is just me kicking off a discussion to see what we
can/should do :)

[1] https://lore.kernel.org/lkml/20221004151748.293388-1-longman@redhat.com=
/