From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 744B2C4167B
	for <linux-mm@archiver.kernel.org>; Sat,  2 Dec 2023 01:57:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9866C8D0080; Fri,  1 Dec 2023 20:57:31 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9373F8D007C; Fri,  1 Dec 2023 20:57:31 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7D7368D0080; Fri,  1 Dec 2023 20:57:31 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 66B3E8D007C
	for <linux-mm@kvack.org>; Fri,  1 Dec 2023 20:57:31 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 0F320403EA
	for <linux-mm@kvack.org>; Sat,  2 Dec 2023 01:57:31 +0000 (UTC)
X-FDA: 81520216302.08.F32363D
Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41])
	by imf24.hostedemail.com (Postfix) with ESMTP id 2A9D1180015
	for <linux-mm@kvack.org>; Sat,  2 Dec 2023 01:57:28 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="MzKZ2C/f";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of bagasdotme@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=bagasdotme@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701482249; a=rsa-sha256;
	cv=none;
	b=0LJjDy5FeqLvBn9vAprDYq4Xu0PZ+RYxLiSC/YhApSPhsUdV4u4XdC4ANrWaf2C38hMcei
	q9jf+S46OBzbMhNkCoiZlHRaO4P6uxvZx6/oudN6GatnKfQkRjp7azCMA3mqd/u6/tKVGC
	5QYMhJGchhSwxk6M7viDrk5RPDg8RH8=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="MzKZ2C/f";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of bagasdotme@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=bagasdotme@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1701482249;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=xrB9T/B1plr1KLo1ojkYN0bScrkWBvXt8AxTXe/RINw=;
	b=UO0WynfcY8jfrcoK4DjJz1Lp4RTdx2ha980AvAJmymgjIwEariX9yacXcidpGidzjxTzSz
	xrnlqtis6uHHhvVhS1H8KcebYMvJWKZF02pyfkzrN3GiQm2SHjPLQYNtPeaIp9S6g8tnnw
	SV5V7JVVCrYXSm78WerR3W+NBvTuX0I=
Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-28654179ec0so1442957a91.2
        for <linux-mm@kvack.org>; Fri, 01 Dec 2023 17:57:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1701482248; x=1702087048; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=xrB9T/B1plr1KLo1ojkYN0bScrkWBvXt8AxTXe/RINw=;
        b=MzKZ2C/fMI5aF7FzRwbZOPzjXLnxCRVPunZxry9XRVRt2cVTuOujGDLT/+PpyUJIhL
         mdS14lxzNv597zAXOyK8vLtG/tzUTDluPsVr1lE1BDwynulPvRu3jAQl/JVh3VIHks6q
         Op/Jed+XhJn8bN6QMcjjsFWBqV3HCe3oArEY7NAmruYzhFAY79m6POxCG8OXE1zNdKm7
         DPpCD057M7fm4Nw/2FU/Vp/F5WdyKW1WmF9AqPITjorSh4h6sZ+gK1cLxaY4WY+Mx5SR
         pn3PahIqAg17zaRLLkaQV9vYtDesWZcwt1z8uvPVfKL0wHotUxW8BNKd+3K0fAoY4Lqp
         b1fg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1701482248; x=1702087048;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=xrB9T/B1plr1KLo1ojkYN0bScrkWBvXt8AxTXe/RINw=;
        b=NeWRwIBTiLQta6GDNxOSseQQYFXoHnQBsxZm7iSYF3lq4cF4h63i9tlLueOBEbohCT
         SpO3YPhSjhNEMkQ4trQCBOdAS1STXL6+oiT4EyVahgP3VGtZMYWphq9AkFQM378yjiGR
         Y2n3L4696zRfEOb8Aq1tZL+gQhlCKF2cuDc9nN1KcNljxMWXKtdVicqP6akvle6LKBq+
         dWsRRS9Lcyge4C21ViD761bNiyp+K1hxRvg+7p1cQ5AZ/NRnzkNQSSp3ilEGBySPa1/3
         BKNStwFZt4xivo9Fyawnl4wEjXoo1Y0pUpBIZMseAmAzl21v62TaqnGAg6LwvrNaarsp
         mNdg==
X-Gm-Message-State: AOJu0YzIZw7RcmQtrB8fhMyGSAoajLTUyOtUXtGZjTmZMKhm3Rq54R/1
	LuTCQhdJBq6DuwBib75yInE=
X-Google-Smtp-Source: AGHT+IEUg0dW+OF3gza2fK1oIx0lCbnofHvNwbjMv9gjLRLawu5IN0DUC+jUjFTWVHbV85RLDGqdkw==
X-Received: by 2002:a17:902:8e86:b0:1cf:ad5f:20ab with SMTP id bg6-20020a1709028e8600b001cfad5f20abmr539146plb.19.1701482247828;
        Fri, 01 Dec 2023 17:57:27 -0800 (PST)
Received: from archie.me ([103.131.18.64])
        by smtp.gmail.com with ESMTPSA id h1-20020a170902f54100b001b3bf8001a9sm3989813plf.48.2023.12.01.17.57.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 01 Dec 2023 17:57:26 -0800 (PST)
Received: by archie.me (Postfix, from userid 1000)
	id 726A810082367; Sat,  2 Dec 2023 08:57:24 +0700 (WIB)
Date: Sat, 2 Dec 2023 08:57:24 +0700
From: Bagas Sanjaya <bagasdotme@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>,
	Michal =?utf-8?Q?Koutn=C3=BD?= <mkoutny@suse.com>,
	Waiman Long <longman@redhat.com>, kernel-team@cloudflare.com,
	Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>,
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>,
	Attreyee M <tintinm2017@gmail.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux CGroups <cgroups@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing
Message-ID: <ZWqPBHCXz4nBIQFN@archie.me>
References: <20231129032154.3710765-1-yosryahmed@google.com>
 <20231129032154.3710765-6-yosryahmed@google.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="nOlWIuGpHWt+jozi"
Content-Disposition: inline
In-Reply-To: <20231129032154.3710765-6-yosryahmed@google.com>
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 2A9D1180015
X-Stat-Signature: hb6goe9hh9xww15pkc6jcs6z7e4qaybr
X-HE-Tag: 1701482248-48040
X-HE-Meta: U2FsdGVkX1/J6r/6ncOw6hhc6vyz3R/LNWpy70Vy99P9EH+XLBC3LA3czKyNQAdG3wwdR55tFWn4cGazhjmFzPIQqeUSm1d52MHxqwlZInLe2kU2s16tGR8jMdGM/W9bd6/yskdxbJb3Sp6XHc/UODzfnD7d20WCZ0YOkwGPKvxulnkDSGi2ukdXF+wCwQI0I85v4yKY906m8B9+In96wIQtdvgL93BE5Jx2N2RDqvksM+Nzl8sBPG/x3PbnVudDETbDFMqW75LYMQWIAN74Yk/o4wFTGPjLGK3dzMnATXwikNZjSnaBL5L7mJZyzp20axvfbfgCV/OV61PNCmQGDp2m3xADXAhHjNnGWCq5ITL2nJbv7JRUEb7R3rNoguyop6muuG6Zmb9ccbtF+0k/MsgM6/V5S5Q+1Q16ZyK0I1NuQN0HbjB0j+ZtrRyh5VPAL3y8JjryfLhUAe49vVxYhqhy5UtOVxDSjhf/HQFmeuDidxk6J39KmHiNqzDQkoBGmCHK5xDICKvqHYzcFbNHxmLN5lN3S+5dLGH0m6CALdYzR269MM/ZNs4DR3IGJjRn7L+s+9fZlc2zHpm+hvWXLB19YJwhzZzw5bgDCS1/ozwF73x9fkIfgtGS0t8i6Gw8NvpKBJ2yTWf2VseAFxuwTKJ46T/KCjFnaa/6/FzU8w8NkFyuUKIUrl5/E4hHPhfmssS3KCa52W8tIGy+Q5hQjkU018OY0g53aiaIKQ+49V6Y441N4ufeHT7Keq32TIPKBPmXwLb/q41HZsVRLae492hK46BGni2Eha0D/3eiRslvx5a9Fy/HaW3DiTJ7TMoGzJIO1sY+qYt51dg5Sw5OQD57qRnk7yLC0/qYSvLueIzMCnGjb74ZkbaWv7lL7fgUCaoevNJiulsN2sl4ab12DWR74JHJL8U/oonIezA8vpy557kJ7Z2trfhFTJkam8qjwWN6kNMm7zF7cLotn37
 /IYRHVAa
 KrlRSlgCT6RlMacd//rMTr/RsR4UxXPOEdfAyGPgcc5p9bqx4ykvBMUcUu9HHI09cz6gcQQ9TXQ8RoEyQgnqapleoypS8rY7pLRCffPbPOnm9tuVuHEzkow1Ogw3qx3kasbcpLmF5TqjbKY8+/9+9kcNkFCb1rrux63f5Z/3RqJdz3R45hXl5QIX7AHV/zSlaKv29B2raa5M3+4kivWGm5e6j7RmV2YDA8F+s1FyvLMn7vSRGWG/bZCtDG+mn7TrH7ISsQXVjF5ic/H+X0dP+9Xb8e6O9NtHMawWZq33Lppm/10+8eGjTxYWQ3Lk6mIo8QSgby+iZJ3SbPwyWDOpQiAIQ/LzJm27nJ4l842MQDaIlr8s2yQewG70WihUmFIC443xERs+EoBN6Qolwx+4zxV5d1Ni3DQqtXPsUzkoHAO+JbWf8BXW4CFVpoDle3z19nxiWFbCSJPaHTWqbAVvn2ThOsNwatgEne3GqLAbV2f8ZRLSjnF+1HGyNx/h0mstHIrCwDMtryqPQWjxULOWpO0QuYi4bkhbDOhJz4Wy25pAfKYOODqw8x99T3ven+uLzLbVu1OZPHiXzHJa2oBrZKgGPeYwjJVaZDCaG9m+WkGv7L0iON6GVrGvRBA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


--nOlWIuGpHWt+jozi
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Nov 29, 2023 at 03:21:53AM +0000, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>   concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
>=20
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
>=20
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>   be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>   flush, and we skip and return before the subtree of interest is
>   actually flushed, yielding stale stats (by up to 2s due to periodic
>   flushing). This is more visible when reading stats from userspace,
>   but can also affect in-kernel flushers.
>=20
> The latter problem is particulary a concern when userspace reads stats
> after an event occurs, but gets stats from before the event. Examples:
> - When memory usage / pressure spikes, a userspace OOM handler may look
>   at the stats of different memcgs to select a victim based on various
>   heuristics (e.g. how much private memory will be freed by killing
>   this). Reading stale stats from before the usage spike in this case
>   may cause a wrongful OOM kill.
> - A proactive reclaimer may read the stats after writing to
>   memory.reclaim to measure the success of the reclaim operation. Stale
>   stats from before reclaim may give a false negative.
> - Reading the stats of a parent and a child memcg may be inconsistent
>   (child larger than parent), if the flush doesn't happen when the
>   parent is read, but happens when the child is read.
>=20
> As for in-kernel flushers, they will occasionally get stale stats. No
> regressions are currently known from this, but if there are regressions,
> they would be very difficult to debug and link to the source of the
> problem.
>=20
> This patch aims to fix these problems by restoring subtree flushing,
> and removing the unified/coalesced flushing logic that skips flushing if
> there is an ongoing flush. This change would introduce a significant
> regression with global stats flushing thresholds. With per-memcg stats
> flushing thresholds, this seems to perform really well. The thresholds
> protect the underlying lock from unnecessary contention.
>=20
> Add a mutex to protect the underlying rstat lock from excessive memcg
> flushing. The thresholds are re-checked after the mutex is grabbed to
> make sure that a concurrent flush did not already get the subtree we are
> trying to flush. A call to cgroup_rstat_flush() is not cheap, even if
> there are no pending updates.
>=20
> This patch was tested in two ways to ensure the latency of flushing is
> up to bar, on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers in 500 cgroups doing
>   allocations and reclaim, as well as 1000 readers for memory.stat
>   (variation of [2]). No regressions were noticed in the total runtime.
>   Note that significant regressions in this test are observed with
>   global stats thresholds, but not with per-memcg thresholds.
>=20
> - A synthetic stress test for concurrently reading memcg stats while
>   memory allocation/freeing workers are running in the background,
>   provided by Wei Xu [3]. With 250k threads reading the stats every
>   100ms in 50k cgroups, 99.9% of reads take <=3D 50us. Less than 0.01%
>   of reads take more than 1ms, and no reads take more than 100ms.
>=20
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO=
9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZ=
cz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=3DiF5Lf_cRnKxUfkiEe0AMDTu6yh=
rUAzX0b6a6rDg@mail.gmail.com/
>=20
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> ---
>  include/linux/memcontrol.h |  8 ++--
>  mm/memcontrol.c            | 75 +++++++++++++++++++++++---------------
>  mm/vmscan.c                |  2 +-
>  mm/workingset.c            | 10 +++--
>  4 files changed, 58 insertions(+), 37 deletions(-)
>=20
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a568f70a26774..8673140683e6e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1050,8 +1050,8 @@ static inline unsigned long lruvec_page_state_local=
(struct lruvec *lruvec,
>  	return x;
>  }
> =20
> -void mem_cgroup_flush_stats(void);
> -void mem_cgroup_flush_stats_ratelimited(void);
> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
> +void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
> =20
>  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item=
 idx,
>  			      int val);
> @@ -1566,11 +1566,11 @@ static inline unsigned long lruvec_page_state_loc=
al(struct lruvec *lruvec,
>  	return node_page_state(lruvec_pgdat(lruvec), idx);
>  }
> =20
> -static inline void mem_cgroup_flush_stats(void)
> +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>  {
>  }
> =20
> -static inline void mem_cgroup_flush_stats_ratelimited(void)
> +static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup =
*memcg)
>  {
>  }
> =20
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 93b483b379aa1..5d300318bf18a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -670,7 +670,6 @@ struct memcg_vmstats {
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwor=
k);
> -static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0);
>  static u64 flush_last_time;
> =20
>  #define FLUSH_TIME (2UL*HZ)
> @@ -731,35 +730,47 @@ static inline void memcg_rstat_updated(struct mem_c=
group *memcg, int val)
>  	}
>  }
> =20
> -static void do_flush_stats(void)
> +static void do_flush_stats(struct mem_cgroup *memcg)
>  {
> -	/*
> -	 * We always flush the entire tree, so concurrent flushers can just
> -	 * skip. This avoids a thundering herd problem on the rstat global lock
> -	 * from memcg flushers (e.g. reclaim, refault, etc).
> -	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> -		return;
> -
> -	WRITE_ONCE(flush_last_time, jiffies_64);
> -
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	if (mem_cgroup_is_root(memcg))
> +		WRITE_ONCE(flush_last_time, jiffies_64);
> =20
> -	atomic_set(&stats_flush_ongoing, 0);
> +	cgroup_rstat_flush(memcg->css.cgroup);
>  }
> =20
> -void mem_cgroup_flush_stats(void)
> +/*
> + * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
> + * @memcg: root of the subtree to flush
> + *
> + * Flushing is serialized by the underlying global rstat lock. There is =
also a
> + * minimum amount of work to be done even if there are no stat updates t=
o flush.
> + * Hence, we only flush the stats if the updates delta exceeds a thresho=
ld. This
> + * avoids unnecessary work and contention on the underlying lock.
> + */

What is global rstat lock?

> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>  {
> -	if (memcg_should_flush_stats(root_mem_cgroup))
> -		do_flush_stats();
> +	static DEFINE_MUTEX(memcg_stats_flush_mutex);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	if (!memcg)
> +		memcg =3D root_mem_cgroup;
> +
> +	if (memcg_should_flush_stats(memcg)) {
> +		mutex_lock(&memcg_stats_flush_mutex);
> +		/* Check again after locking, another flush may have occurred */
> +		if (memcg_should_flush_stats(memcg))
> +			do_flush_stats(memcg);
> +		mutex_unlock(&memcg_stats_flush_mutex);
> +	}
>  }
> =20
> -void mem_cgroup_flush_stats_ratelimited(void)
> +void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
>  {
>  	/* Only flush if the periodic flusher is one full cycle late */
>  	if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
> -		mem_cgroup_flush_stats();
> +		mem_cgroup_flush_stats(memcg);
>  }
> =20
>  static void flush_memcg_stats_dwork(struct work_struct *w)
> @@ -768,7 +779,7 @@ static void flush_memcg_stats_dwork(struct work_struc=
t *w)
>  	 * Deliberately ignore memcg_should_flush_stats() here so that flushing
>  	 * in latency-sensitive paths is as cheap as possible.
>  	 */
> -	do_flush_stats();
> +	do_flush_stats(root_mem_cgroup);
>  	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>  }
> =20
> @@ -1664,7 +1675,7 @@ static void memcg_stat_format(struct mem_cgroup *me=
mcg, struct seq_buf *s)
>  	 *
>  	 * Current memory state:
>  	 */
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(memcg);
> =20
>  	for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) {
>  		u64 size;
> @@ -4214,7 +4225,7 @@ static int memcg_numa_stat_show(struct seq_file *m,=
 void *v)
>  	int nid;
>  	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
> =20
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(memcg);
> =20
>  	for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) {
>  		seq_printf(m, "%s=3D%lu", stat->name,
> @@ -4295,7 +4306,7 @@ static void memcg1_stat_format(struct mem_cgroup *m=
emcg, struct seq_buf *s)
> =20
>  	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1_stats=
));
> =20
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(memcg);
> =20
>  	for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) {
>  		unsigned long nr;
> @@ -4791,7 +4802,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, =
unsigned long *pfilepages,
>  	struct mem_cgroup *memcg =3D mem_cgroup_from_css(wb->memcg_css);
>  	struct mem_cgroup *parent;
> =20
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(memcg);
> =20
>  	*pdirty =3D memcg_page_state(memcg, NR_FILE_DIRTY);
>  	*pwriteback =3D memcg_page_state(memcg, NR_WRITEBACK);
> @@ -6886,7 +6897,7 @@ static int memory_numa_stat_show(struct seq_file *m=
, void *v)
>  	int i;
>  	struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m);
> =20
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(memcg);
> =20
>  	for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) {
>  		int nid;
> @@ -8125,7 +8136,11 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  			break;
>  		}
> =20
> -		cgroup_rstat_flush(memcg->css.cgroup);
> +		/*
> +		 * mem_cgroup_flush_stats() ignores small changes. Use
> +		 * do_flush_stats() directly to get accurate stats for charging.
> +		 */
> +		do_flush_stats(memcg);
>  		pages =3D memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>  		if (pages < max)
>  			continue;
> @@ -8190,8 +8205,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *=
objcg, size_t size)
>  static u64 zswap_current_read(struct cgroup_subsys_state *css,
>  			      struct cftype *cft)
>  {
> -	cgroup_rstat_flush(css->cgroup);
> -	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
> +	struct mem_cgroup *memcg =3D mem_cgroup_from_css(css);
> +
> +	mem_cgroup_flush_stats(memcg);
> +	return memcg_page_state(memcg, MEMCG_ZSWAP_B);
>  }
> =20
>  static int zswap_max_show(struct seq_file *m, void *v)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d8c3338fee0fb..0b8a0107d58d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2250,7 +2250,7 @@ static void prepare_scan_control(pg_data_t *pgdat, =
struct scan_control *sc)
>  	 * Flush the memory cgroup stats, so that we read accurate per-memcg
>  	 * lruvec stats for heuristics.
>  	 */
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats(sc->target_mem_cgroup);
> =20
>  	/*
>  	 * Determine the scan balance between anon and file LRUs.
> diff --git a/mm/workingset.c b/mm/workingset.c
> index dce41577a49d2..7d3dacab8451a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -464,8 +464,12 @@ bool workingset_test_recent(void *shadow, bool file,=
 bool *workingset)
> =20
>  	rcu_read_unlock();
> =20
> -	/* Flush stats (and potentially sleep) outside the RCU read section */
> -	mem_cgroup_flush_stats_ratelimited();
> +	/*
> +	 * Flush stats (and potentially sleep) outside the RCU read section.
> +	 * XXX: With per-memcg flushing and thresholding, is ratelimiting
> +	 * still needed here?
> +	 */
> +	mem_cgroup_flush_stats_ratelimited(eviction_memcg);

What if flushing is not rate-limited (e.g. above line is commented)?

> =20
>  	eviction_lruvec =3D mem_cgroup_lruvec(eviction_memcg, pgdat);
>  	refault =3D atomic_long_read(&eviction_lruvec->nonresident_age);
> @@ -676,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrink=
er *shrinker,
>  		struct lruvec *lruvec;
>  		int i;
> =20
> -		mem_cgroup_flush_stats();
> +		mem_cgroup_flush_stats(sc->memcg);
>  		lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>  		for (pages =3D 0, i =3D 0; i < NR_LRU_LISTS; i++)
>  			pages +=3D lruvec_page_state_local(lruvec,

Confused...

--=20
An old man doll... just what I always wanted! - Clara

--nOlWIuGpHWt+jozi
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iHUEABYKAB0WIQSSYQ6Cy7oyFNCHrUH2uYlJVVFOowUCZWqPAAAKCRD2uYlJVVFO
o679AQCaiziz0f+tw3jC9nOLLQDBlaTl8wi71FJT4Q7x3iXRXwD+JDR9uzJPnLAw
yBzdxmHDgkPi3OrCa7Gr1JO8CXG6RQY=
=B70i
-----END PGP SIGNATURE-----

--nOlWIuGpHWt+jozi--