From: Shakeel Butt <shakeel.butt@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Yosry Ahmed <yosryahmed@google.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
Yu Zhao <yuzhao@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Meta kernel team <kernel-team@meta.com>,
cgroups@vger.kernel.org
Subject: [PATCH v2] memcg: use ratelimited stats flush in the reclaim
Date: Tue, 13 Aug 2024 14:53:58 -0700 [thread overview]
Message-ID: <20240813215358.2259750-1-shakeel.butt@linux.dev> (raw)
The Meta prod is seeing large amount of stalls in memcg stats flush
from the memcg reclaim code path. At the moment, this specific callsite
is doing a synchronous memcg stats flush. The rstat flush is an
expensive and time consuming operation, so concurrent relaimers will
busywait on the lock potentially for a long time. Actually this issue is
not unique to Meta and has been observed by Cloudflare [1] as well. For
the Cloudflare case, the stalls were due to contention between kswapd
threads running on their 8 numa node machines which does not make sense
as rstat flush is global and flush from one kswapd thread should be
sufficient for all. Simply replace the synchronous flush with the
ratelimited one.
One may raise a concern on potentially using 2 sec stale (at worst)
stats for heuristics like desirable inactive:active ratio and preferring
inactive file pages over anon pages but these specific heuristics do not
require very precise stats and also are ignored under severe memory
pressure.
More specifically for this code path, the stats are needed for two
specific heuristics:
1. Deactivate LRUs
2. Cache trim mode
The deactivate LRUs heuristic is to maintain a desirable inactive:active
ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE*
and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to
check if there is a refault since last snapshot and the LRU size are
needed for the desirable ratio between inactive and active LRUs. See the
table below on how the desirable ratio is calculated.
/* total target max
* memory ratio inactive
* -------------------------------------
* 10MB 1 5MB
* 100MB 1 50MB
* 1GB 3 250MB
* 10GB 10 0.9GB
* 100GB 31 3GB
* 1TB 101 10GB
* 10TB 320 32GB
*/
The desirable ratio only changes at the boundary of 1 GiB, 10 GiB,
100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate
LRU size information to calculate this ratio. In addition, if
deactivation is skipped for some LRU, the kernel will force deactive on
the severe memory pressure situation.
For the cache trim mode, inactive file LRU size is read and the kernel
scales it down based on the reclaim iteration (file >> sc->priority) and
only checks if it is zero or not. Again precise information is not
needed.
This patch has been running on Meta fleet for several months and we have
not observed any issues. Please note that MGLRU is not impacted by this
issue at all as it avoids rstat flushing completely.
Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
Changes since v1:
- Updated the commit message.
mm/vmscan.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 008b62abf104..82318464cd5e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
/*
- * Flush the memory cgroup stats, so that we read accurate per-memcg
- * lruvec stats for heuristics.
+ * Flush the memory cgroup stats in rate-limited way as we don't need
+ * most accurate stats here. We may switch to regular stats flushing
+ * in the future once it is cheap enough.
*/
- mem_cgroup_flush_stats(sc->target_mem_cgroup);
+ mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
/*
* Determine the scan balance between anon and file LRUs.
--
2.43.5
next reply other threads:[~2024-08-13 21:54 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-13 21:53 Shakeel Butt [this message]
2024-08-13 21:58 ` Yosry Ahmed
2024-08-13 22:30 ` Shakeel Butt
2024-08-14 12:57 ` Jesper Dangaard Brouer
2024-08-14 16:32 ` Shakeel Butt
2024-08-14 23:03 ` Nhat Pham
2024-08-14 23:42 ` Shakeel Butt
2024-08-14 23:48 ` Yosry Ahmed
2024-08-15 0:19 ` Nhat Pham
2024-08-15 0:22 ` Yosry Ahmed
2024-08-15 0:29 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240813215358.2259750-1-shakeel.butt@linux.dev \
--to=shakeel.butt@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hawk@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox