From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B0A01EDEC73
	for <linux-mm@archiver.kernel.org>; Wed, 13 Sep 2023 15:38:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 034326B01A0; Wed, 13 Sep 2023 11:38:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F266D6B01A2; Wed, 13 Sep 2023 11:38:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DC7266B01A3; Wed, 13 Sep 2023 11:38:03 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id CCDF46B01A0
	for <linux-mm@kvack.org>; Wed, 13 Sep 2023 11:38:03 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 868F4A0E4D
	for <linux-mm@kvack.org>; Wed, 13 Sep 2023 15:38:03 +0000 (UTC)
X-FDA: 81231980046.04.E1B87A9
Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45])
	by imf02.hostedemail.com (Postfix) with ESMTP id 71E3780019
	for <linux-mm@kvack.org>; Wed, 13 Sep 2023 15:38:00 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p39JJQXc;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1694619480;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=;
	b=iXHxZe9X5fwWsCHF3s8Lkgm05qISdNjI/L3TFCS1Ax1xLuQSzRQ8SrrejGy51LKvIrVfHM
	OgdG6H33F2JCAB8y7g2W0zqBfmArwewalbQYG96M0KpDSbFck+IioHHgmhTMFP+fln6MIJ
	Ln9xervQIkz0yFEKLooE/I8jNf3+fXs=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p39JJQXc;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf02.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694619480; a=rsa-sha256;
	cv=none;
	b=4oM+XDpj7Hd9M9JiIJkKokP7M1s3D2VrwpenFnc7n1D32slU1milK4qvGGEWzO0kJDbN3c
	/boC5f5ZGX3C/XQ/yEkpFh5zg1P8PvVg+xracwWio6kyEPOL/edh/v82Tob4l2k+hGOph6
	w+2B43A7G2YA9pONciPzaXrdXRow7BM=
Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-6bf2427b947so4360064a34.3
        for <linux-mm@kvack.org>; Wed, 13 Sep 2023 08:38:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1694619479; x=1695224279; darn=kvack.org;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=;
        b=p39JJQXcDjYxzk1hoZ7xcBSxWXJJGnsqTFhjJJAnuTH48mfLjQi2kXIUrDi3tz+YB7
         ut2EAFRPlEuRHLgI38MtSPh2SbC96olAuCxSA6ZnHTx1ukXNSHsyv39F/Lo/RIkvuS+Q
         z+lrWC8GiWAjoOrTe3pTzdCHguMgun/djmDava2/ffJrBdfOwprM+PNEhROzf0NzUJRw
         lPmsAebfzHBx4ZdHu4DouQxmg7o/ObgndXmJUx1MDOPs8kZzYp0V0L4L/BvOm1BM3NO2
         MPUUBDcJr2fYtxk7DmBCTN/aTR61LaB9QjTbO0VV5tFOxMngiVQegUIepi2M/ikr+HjD
         PRWw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1694619479; x=1695224279;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=pes0x2FgCzst9zO4by3a7HiJnngRgun+ysnigDkdDBI=;
        b=obB/L7Xe75ySuSjvS7R8JxPcfxt0b/0ojI5zx0Q79fK9cKdIRC9C0d7Yz7NmeiPxT0
         s3InkVy2RQfiI8prEkVm379YQlTvRZessQT/J40epgx6Kl2InMdWHfLCBK5FXozvL/2D
         INAa6iccJ8SPiV5abkLO7BEGvNnmaZ01CftUY2UQkzVXaho5S9uucudtu6OXk4I3/ukD
         MNP/zN7OjXA9F92ZZ2HD5DlOCAncJ2QUlkCoACCm7HDshvevkxeh1biq/CiPQfCwCrYV
         4EHn2pFfyNLMp/qV2G3U7d4jhik9pwzenIqJtVNGmBuWsg6LaQ7CnQAUxz/uVLK+GBkC
         dUGw==
X-Gm-Message-State: AOJu0Yx3gdVInkJD31loo69vdYq+ZhORtRcXK+zbwO8uDxmMTqXG7ZtI
	klLJeBhmeDgOycrWSogv8yPzXA==
X-Google-Smtp-Source: AGHT+IEPPPHSfrfy7IdWyHRRjhuOQ3KghAaHDzL4YMfoWPb61lBUcsPQqhoRRgadgxgnlLOngUQIqQ==
X-Received: by 2002:a05:6358:5e01:b0:130:faea:a81f with SMTP id q1-20020a0563585e0100b00130faeaa81fmr2210210rwn.28.1694619479319;
        Wed, 13 Sep 2023 08:37:59 -0700 (PDT)
Received: from localhost (2603-7000-0c01-2716-3012-16a2-6bc2-2937.res6.spectrum.com. [2603:7000:c01:2716:3012:16a2:6bc2:2937])
        by smtp.gmail.com with ESMTPSA id x15-20020a0ce0cf000000b006427145590csm4495804qvk.48.2023.09.13.08.37.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Sep 2023 08:37:58 -0700 (PDT)
Date: Wed, 13 Sep 2023 11:37:58 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>,
	Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>,
	Waiman Long <longman@redhat.com>, kernel-team@cloudflare.com,
	Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and
 accuracy
Message-ID: <20230913153758.GB45543@cmpxchg.org>
References: <20230913073846.1528938-1-yosryahmed@google.com>
 <20230913073846.1528938-4-yosryahmed@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20230913073846.1528938-4-yosryahmed@google.com>
X-Rspamd-Queue-Id: 71E3780019
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: xzyxbceim48xmo9oru4yhz9dz64koc7h
X-HE-Tag: 1694619480-922969
X-HE-Meta: U2FsdGVkX193NBjbR+0xWmA9acKKqLFpQnk4cQq/wOrjoR22EUX0KKkXMXIrZwSVaWGtDnJW2p6Lu6MtTCzMo9C73YJkYTxJVGcmuI0SCW7eY/NG2EWP+DFN30LOURUluM80+sFUxcuFf0hOlVhL16Ycu6JBZgt8hrRReAeOPgkWn8HREuPWS/ue2ll0JfIBnadBcK5LNT6kX/h329zzWtYaPBPMmltNDIoXQKRuKeJ/QXOdGzTFU/NHQcftQIVmTqdouanKUx1lqaAPIeH257lhDJpYc4SPON+lJyYTzUFVkO12BPvpK5OoIEeht4xq/nBd8W6PGon+6FgxYWR1vyO4yqGM7RLaQBwE640EnPTEqmwTvalgiBPeTFfBBhkVc68RrTDXnbFqiuy36cEmBcgEs8p4Meu3le0ceqNESbweQPfm0UlvQeRMBNurSCGd5lyMujwcfs7iVGe2/dMvLSc1JM2PrJRtZ3AuFLjuCpkcX9JKTvxMHl3jwZ/HL9PQ1w7WXi2IyxwNStIVNJD9IgaqO5KZviAwwOxUvl9MwdlnyfJpDMbBDHF82w1kefBIuphsMQLuSgP1Newi+luxvw3Y0ImsTzTGvtKPck5VZLrPEnGF6JD6V04PjXqVmfCIuyvz/cayaYL4iPJGc0cjtpD18fUOy/l9IwkybzAF/ZcbEreLZLLkfDONjy1Zircu2tCirDs6sSbrRs2EcVPRhRD5nZcpeSqd4HB8YTEeGaYk4Vp4pDGxP8S12bNgF875s0/1gJZyAdatScyKbCPQ9tgj+lXWuwG6BJyf3u8+L5+IhE/AUNmMKohgSZ3f8uPld3k24ambK5CCFvgvwJT3e1i7noTh9vkPnwQc67tcJA35pcdFHPXwKOcfjzK8oV23P7w5CaYwxT0pnqzlA1Pcc3oF3jw3+1tAD1UoHZs/L2vMYn7H9AHQeKcyXmi2X9gYi/ce01iYv3ezXaK33bL
 ChrqeueT
 zYCdbKUsS+d/gB7OyKhOBFIFWQJwb1F9SRgHy3PI9i0Mh4bgOjbm9dB1oJluuiHDr85yw8MAyjgC5QVu19/6sWd61ayrJ6d8yMbwxEp6Em+40sqor+Q3EbvXCGET6z44O0m8JRHbL4RMYtzIPOK3gDjnWx4vr3Tj5RaaKIrfjsP90PCHiYmUhHqMLC2iqsVZd+s8sL9vbtGTsz1yfGJM+Aq8b7fgCnsLqTiCCjPz58Hlha/FmkwI7zdXTF1II0CTGxPaWnRlMCKewpoXS7frzekfrFrWDfFR9Rdz32bKc1cDrv+CkDXzxlKD0CdVydaPEg/x9Zp091Vo/+Pd54/SEkY+PAyYmwCpnsFcKsepBEbWee7ovUqqhM1QZkyZU6xTf56dtK11JpXvperg1YQlaRLdKyGNZvW8AK7b5AzfPrFABXvQf+C6Y30Nu7DCusLWkGPKM9cqfYc7+pv+L7OYPEYk3WUYfB6dBRBAscEWlAZ6RlN/wOMXXOG0AamoPol+IHwAsKqRiesf94QLjpvKk0plOjqpY5W0xWwrGMlZWAgoat8E3jtkH9K+yzmj4B9YJqyb9q64QaIF4Lk1WN8QYmeiCQ1hu459UQINy+CZ8rboWFquX77AEci9pBnANs2lE5EmMchIu/tYsVi3eFkhDg+Fc9sJSM4/ZeIRVNSve7DhcLDI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>   concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
> 
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
> 
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>   be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>   flush, and we skip and return before the subtree of interest is
>   actually flushed. This is more visible when reading stats from
>   userspace, but can also affect in-kernel flushers.
> 
> This patch aims to solve both problems by reworking how flushing
> currently works as follows:
> - Without contention, there is no need to flush the entire tree. In this
>   case, only flush the subtree of interest. This avoids the latency of a
>   full root flush if unnecessary.
> - With contention, fallback to a coalesced (aka unified) flush of the
>   entire hierarchy, a root flush. In this case, instead of returning
>   immediately if a root flush is ongoing, wait for it to finish
>   *without* attempting to acquire the lock or flush. This is done using
>   a completion. Compared to competing directly on the underlying lock,
>   this approach makes concurrent flushing a synchronization point
>   instead of a serialization point. Once  a root flush finishes, *all*
>   waiters can wake up and continue at once.
> - Finally, with very high contention, bound the number of waiters to the
>   number of online cpus. This keeps the flush latency bounded at the tail
>   (very high concurrency). We fallback to sacrificing stats freshness only
>   in such cases in favor of performance.
> 
> This was tested in two ways on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers doing allocations and
>   reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>   No significant regressions were noticed in the total runtime.
>   Note that if concurrent flushers compete directly on the spinlock
>   instead of waiting for a completion, this test shows 2x-3x slowdowns.
>   Even though subsequent flushers would have nothing to flush, just the
>   serialization and lock contention is a major problem. Using a
>   completion for synchronization instead seems to overcome this problem.
> 
> - A synthetic stress test for concurrently reading memcg stats provided
>   by Wei Xu.
>   With 10k threads reading the stats every 100ms:
>   - 98.8% of reads take <100us
>   - 1.09% of reads take 100us to 1ms.
>   - 0.11% of reads take 1ms to 10ms.
>   - Almost no reads take more than 10ms.
>   With 10k threads reading the stats every 10ms:
>   - 82.3% of reads take <100us.
>   - 4.2% of reads take 100us to 1ms.
>   - 4.7% of reads take 1ms to 10ms.
>   - 8.8% of reads take 10ms to 100ms.
>   - Almost no reads take more than 100ms.
> 
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> 
> [weixugc@google.com: suggested the fallback logic and bounding the
> number of waiters]
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

>  	/*
> +	 * Opportunistically try to only flush the requested subtree. Otherwise
> +	 * fallback to a coalesced flush below.
>  	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> +		cgroup_rstat_flush(memcg->css.cgroup);
> +		mutex_unlock(&subtree_flush_mutex);
>  		return;
> +	}
>  
> -	WRITE_ONCE(flush_last_time, jiffies_64);
> -
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	/* A coalesced root flush is in order. Are we the designated flusher? */
> +	spin_lock(&root_flusher_lock);

I can't say I'm crazy about this.

Let's say you have a fairly sprawling and active cgroup tree with lots
of updates in lots of groups in the system.

Add a periodic memory.stat reader to one of the subgroups, and that
reader will do very fast, localized aggregation.

Now add a periodic memory.stat reader to some other subgroup. They
might still both do very fast, localized aggregation. Or they might
collide; and now - despite having nothing in common, and sharing no
data besides the rstat lock - will have to flush the entire massive
tree. The rate at which this happens is purely dependent on timing of
what should be independent actors. This is really not great for the
p50 vs p99 gap.

I think this whole thing is getting away from us.

When we first switched to rstat, every reader would take the global
rstat lock and then flush its local subtree of interest.

This was changed in the following commit:

commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
Author: Shakeel Butt <shakeelb@google.com>
Date:   Fri Nov 5 13:37:34 2021 -0700

    memcg: unify memcg stat flushing
    
    The memcg stats can be flushed in multiple context and potentially in
    parallel too.  For example multiple parallel user space readers for
    memcg stats will contend on the rstat locks with each other.  There is
    no need for that.  We just need one flusher and everyone else can
    benefit.
    
    In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
    stats") the kernel periodically flush the memcg stats from the root, so,
    the other flushers will potentially have much less work to do.
    
    Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: "Michal Koutnı" <mkoutny@suse.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

The idea was that we can avoid lock contention if somebody is already
doing the flushing. However, you're now bringing global serialization.
Clearly that idea didn't work out. What would be an obstacle to go
back to the original way of doing it?

With one reader, this will work the same as in your proposal.

With two readers, just like in your proposal, flushing must be
serialized against the root level. But at least the two flushes only
aggregate the local data they actually care about - not the entire
tree data that doesn't even have readers! This is much better for lock
contention and performance.

One concern is the thresholding code. The cost of flushing on every
read is too high: even when there is no flush work, checking for it is
kind of expensive due to the locks and the for_each_possible_cpu().

Could we do something like the following?

	mem_cgroup_flush(memcg)
	{
		mutex_lock(&memcg_flush_mutex);
		if (atomic_read(&memcg->stat_delta) > THRESHOLD)
			cgroup_rstat_flush(memcg);
		mutex_unlock(&memcg_flush_mutex);
	}

	mem_cgroup_css_rstat_flush(css, cpu)
	{
		...
		/*
		 * Reset here instead of mem_cgroup_flush()
		 * so that each flushed subgroup is reset
		 * recursively. A recent parent flush will
		 * allow a child flush to skip.
		 */
		atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
	}

	memcg_rstat_updated(memcg, val)
	{
		atomic_add(&memcg->stat_delta, val);
	}