From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC468CA0FE2 for ; Tue, 5 Sep 2023 15:56:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 31FB08E0001; Tue, 5 Sep 2023 11:56:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D04F8D0001; Tue, 5 Sep 2023 11:56:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 173048E0001; Tue, 5 Sep 2023 11:56:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0695C8D0001 for ; Tue, 5 Sep 2023 11:56:37 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7CEE440938 for ; Tue, 5 Sep 2023 15:56:36 +0000 (UTC) X-FDA: 81202996392.03.3730A19 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf27.hostedemail.com (Postfix) with ESMTP id 7B5F240027 for ; Tue, 5 Sep 2023 15:56:34 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=InXjQWkp; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693929394; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ku5vsXzmlzdjg51CBHusCSCeTyx/p0r0SzuTPz7HFEM=; b=2tUWZ0xsEqfQc/B7hmJEZK+z5HM6t0kNBlvhaA8/0tXYOsCd/bP97HUp+bSQbJQxsfPpwY nklDzDZKuv0gVADlRytTCMkeDsNzklozSBq8piR0WXhoRljiQVDis7mPEWANrqryLddzu3 KAF55fdwPY89bkoCGbcdr9akEc1JVSU= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=InXjQWkp; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693929394; a=rsa-sha256; cv=none; b=nd7GBDEFKuq2VxFDe+kuSNw7t2hliB4kvXsRq0JEZFToPfWHA+aIGjTQ64dbZbtke6+Nbc G/YUCnr2LA+onkCxq4bsYo0d8vkgbXXuPgHukFOueAkn5nO0Bqb4r7wg3R7HtyR8QUYo70 HHZJMAVXDgdcTE9sxZVqRblo6QQ4ftY= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-9936b3d0286so416803066b.0 for ; Tue, 05 Sep 2023 08:56:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693929393; x=1694534193; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ku5vsXzmlzdjg51CBHusCSCeTyx/p0r0SzuTPz7HFEM=; b=InXjQWkp2Vnw6DEgpoNtEf12qi7oBlZAQid0So+b3CZAiF5S4yrzCreE234EKU4vRg y9pKn48AVbTNFLiU71X2axZGO8aWTwXxhBremUP1RlHuqEwdC/lgGND0ffp7uqT+NgAq u7gVi8luIq3a7Yt/ABnCO1zTTa1Kijluy/Wxo2GO5WZNLoQkqM62jMrMz95Zfq+7rpFI 99rlH8hB52+mi1tIx78Q0U+n8g2OWTG+PJ0tg9GH9BXu59B514AIFQIkgUp3kEJu1RUX jcx7oJLfn1Y090cu7FaFtlPXDFZrX7/k3VuesuYZO07sBCvk8Wp+siQPLBfL6nCxl56f 0+WA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693929393; x=1694534193; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ku5vsXzmlzdjg51CBHusCSCeTyx/p0r0SzuTPz7HFEM=; b=LPfMbfmW9hE2ysZF8FuZRkzSRI4wnfOjdiMyP5qiDz/KSmwLj6JaPXASKk2XZazxfJ wO1pJG28jkDodhZR8CacVunHhmVdsiAeRCKw3qp2J9nxOAwkW1u8ZqqbheWQfU2aBePJ zYQRZJdfXrPzPhiHxzKDD8YBy40um5klY7g334nHkXQWD1h1iRSf0gDMBHV3oLDgp680 FlbiS3fpuXEXBhG2thoo34vIJHyrtD2q7oROTGAqKYLF30wf/MCb9A84o2w9LEW9JwKK 2SIj3fJtONRJ6T7FvtdTeEQf6ccbuWn4bhjwAAAKQ9pQ7xBU0uvei89gPD7/1MkE7gQt VsMg== X-Gm-Message-State: AOJu0Ywj/UugaRW8SK/CSjsCp787MTk2UdA1F8/+ue4jzPxK1mCmCaLp b3YmoEfKHotgqFf0FZl1Aq7sU1VNe2N+KMV7jspc+Q== X-Google-Smtp-Source: AGHT+IGzzsaz5hxG0RZEVaIt/5m7iasR/HdcmqFKrawo0vYnZc+yyHQHuiHd7maOlM0Uvj8Qe3Han/QClFUN+Vs7jyY= X-Received: by 2002:a17:907:1dc3:b0:9a1:bb8f:17de with SMTP id og3-20020a1709071dc300b009a1bb8f17demr212504ejc.35.1693929393055; Tue, 05 Sep 2023 08:56:33 -0700 (PDT) MIME-Version: 1.0 References: <20230831165611.2610118-1-yosryahmed@google.com> <20230831165611.2610118-2-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Tue, 5 Sep 2023 08:55:56 -0700 Message-ID: Subject: Re: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing To: Michal Hocko Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7B5F240027 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: k8px7ia3omjx7z9n6nztwq6r9iiwoyx9 X-HE-Tag: 1693929394-839652 X-HE-Meta: U2FsdGVkX18LxcN8rhrKPMRXPAnRjEn5nggw+gocaMCOOVR0XUxjqnPMMGm7UWOW3KxXUUQfLKbaTxrLcogUrKCb+N2PPK7h66xi+VnzS0MhL6dpKi2+vJ6aMR2Fhr4tmAPpDZTKaBxqEYgM/GecKhO3XX3kWXPeInEM2InCmKUIJSKA8w2ppTUF7k9+LaNc1vKFVmJxsLxtrvnpMKRt6+6yNVPGt3so07qDZHGcmvPEtgQnk15OcqRxII9nghfYaGMutPxVWqXNVlT00FcCbP7xXW9x36yicvm+S/Ui5W9ESY40zzIQ9/NcxLVJtH7MXgB+Ma/JKd3/QHMqwntfTQA9DA4JMaX0JUBnbyC7XryL+GcAVh75cIUZysdWJc020jZUKRvWuS6MV5LFksxoV5PB78u40LBX0zgbcpKmZcUZKzKW4ADdSvf7l2Q+vf0wmxx2zxzDaZ3bM97z8HOwCnCFS5xM6whxEDPOypDghtIoLB2wsLniX+zT5fuHMx9Zdmyy+4JnX5lxP18B6avhsDKWJ5Zfz+NbqslaeHiLMhWpp9IkCmLm8wUBaArwExdBXIPjkfPndmFgnLVPHl2D0eRq/W/Y0cr6K6txWVdiVy7HRZ2gPQBdrsmlD9IssnGaODCqlaBg+0k6q9rtGqSm1WH0UvKve7CF07xQWfNAhbpiHOr/mfm0s3KsdywvwaWkXX61mUabzRhJaqhgiS6JZ3mCAKXynPxkS9tFumYMqySYhQMZB5RkSDKRKkZSvOuhSsoJqBzkRocswBFVQDg3WVQcNwHYVxjPfccDmOwZQgnCJ4CVVbc6rGxQi/wTvMr02yXy3EXDyrV0lYe4WhuvdjIX3+LVwXhqF5KvQhNNAbwUtscZPyXblpaKAHm/PreKGhPTXoKI+J0TWYaFLf02XSvognBaRKmzSOY5QaN8HYaYjGOOo9ePr1OIAGJt+ZQSQo1oevMrDP1l8tMvjuC F54ZKdI5 ZH5Mb6O365o6TPbceBcYyGqUlRVDbufFhcKfBArXprqbrvfitsb8eM5VL1eO66A7uq0dEdxW3vRE0CkvYKrTTq/3hxzN0BFX43m7xaJsqEQb4N8QCXJjm7MCmhA9AQrrSdBMUYG3awA1o3/2aW4hkU2eQDqkmQpHZpFTZKY633tOdSvxI8YSRd8fb9tpv57c7UMilD3pYW76eYbO76nrG8E8MoebMaVzezACXFaqtv3x6n/fYajp+tTcZPYOlcWUycXtoMpGEvwhHTillg9lU+wTQPogdARUCxd0pEnse6wPyJTp2R6Z5dEhKWg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 4, 2023 at 7:44=E2=80=AFAM Michal Hocko wrote= : > > On Thu 31-08-23 16:56:08, Yosry Ahmed wrote: > > Most contexts that flush memcg stats use "unified" flushing, where > > basically all flushers attempt to flush the entire hierarchy, but only > > one flusher is allowed at a time, others skip flushing. > > > > This is needed because we need to flush the stats from paths such as > > reclaim or refaults, which may have high concurrency, especially on > > large systems. Serializing such performance-sensitive paths can > > introduce regressions, hence, unified flushing offers a tradeoff betwee= n > > stats staleness and the performance impact of flushing stats. > > > > Document this properly and explicitly by renaming the common flushing > > helper from do_flush_stats() to do_unified_stats_flush(), and adding > > documentation to describe unified flushing. Additionally, rename > > flushing APIs to add "try" in the name, which implies that flushing wil= l > > not always happen. Also add proper documentation. > > > > No functional change intended. > > > > Signed-off-by: Yosry Ahmed > > No objections to renaming but it would be really nice to simplify this. > It is just "funny" to see 4 different flushing methods (flush from > userspace, flush, try_flush_with_ratelimit_1 and try_flush_with_ratelimit= _2). > This is all internal so I am not all that worried that this would get > confused but it surely is rather convoluted. > > Acked-by: Michal Hocko Thanks! I tried to at least keep the naming consistent and add documentation, but I agree we can do better :) It's less obvious to me tbh where we can improve, but hopefully when someone new to this code comes complaining we'll know better what needs to be changed. > > > --- > > include/linux/memcontrol.h | 8 ++--- > > mm/memcontrol.c | 61 +++++++++++++++++++++++++------------- > > mm/vmscan.c | 2 +- > > mm/workingset.c | 4 +-- > > 4 files changed, 47 insertions(+), 28 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 11810a2cfd2d..d517b0cc5221 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_loc= al(struct lruvec *lruvec, > > return x; > > } > > > > -void mem_cgroup_flush_stats(void); > > -void mem_cgroup_flush_stats_ratelimited(void); > > +void mem_cgroup_try_flush_stats(void); > > +void mem_cgroup_try_flush_stats_ratelimited(void); > > > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_it= em idx, > > int val); > > @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_l= ocal(struct lruvec *lruvec, > > return node_page_state(lruvec_pgdat(lruvec), idx); > > } > > > > -static inline void mem_cgroup_flush_stats(void) > > +static inline void mem_cgroup_try_flush_stats(void) > > { > > } > > > > -static inline void mem_cgroup_flush_stats_ratelimited(void) > > +static inline void mem_cgroup_try_flush_stats_ratelimited(void) > > { > > } > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index cf57fe9318d5..2d0ec828a1c4 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgrou= p_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dw= ork); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > -static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0); > > +static atomic_t stats_unified_flush_ongoing =3D ATOMIC_INIT(0); > > static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); > > static u64 flush_next_time; > > > > @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_c= group *memcg, int val) > > /* > > * If stats_flush_threshold exceeds the threshold > > * (>num_online_cpus()), cgroup stats update will be trig= gered > > - * in __mem_cgroup_flush_stats(). Increasing this var fur= ther > > + * in mem_cgroup_try_flush_stats(). Increasing this var f= urther > > * is redundant and simply adds overhead in atomic update= . > > */ > > if (atomic_read(&stats_flush_threshold) <=3D num_online_c= pus()) > > @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem= _cgroup *memcg, int val) > > } > > } > > > > -static void do_flush_stats(void) > > +/* > > + * do_unified_stats_flush - do a unified flush of memory cgroup statis= tics > > + * > > + * A unified flush tries to flush the entire hierarchy, but skips if t= here is > > + * another ongoing flush. This is meant for flushers that may have a l= ot of > > + * concurrency (e.g. reclaim, refault, etc), and should not be seriali= zed to > > + * avoid slowing down performance-sensitive paths. A unified flush may= skip, and > > + * hence may yield stale stats. > > + */ > > +static void do_unified_stats_flush(void) > > { > > - /* > > - * We always flush the entire tree, so concurrent flushers can ju= st > > - * skip. This avoids a thundering herd problem on the rstat globa= l lock > > - * from memcg flushers (e.g. reclaim, refault, etc). > > - */ > > - if (atomic_read(&stats_flush_ongoing) || > > - atomic_xchg(&stats_flush_ongoing, 1)) > > + if (atomic_read(&stats_unified_flush_ongoing) || > > + atomic_xchg(&stats_unified_flush_ongoing, 1)) > > return; > > > > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); > > @@ -655,19 +659,34 @@ static void do_flush_stats(void) > > cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > > > atomic_set(&stats_flush_threshold, 0); > > - atomic_set(&stats_flush_ongoing, 0); > > + atomic_set(&stats_unified_flush_ongoing, 0); > > } > > > > -void mem_cgroup_flush_stats(void) > > +/* > > + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics > > + * > > + * Try to flush the stats of all memcgs that have stat updates since t= he last > > + * flush. We do not flush the stats if: > > + * - The magnitude of the pending updates is below a certain threshold= . > > + * - There is another ongoing unified flush (see do_unified_stats_flus= h()). > > + * > > + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME = due to > > + * periodic flushing. > > + */ > > +void mem_cgroup_try_flush_stats(void) > > { > > if (atomic_read(&stats_flush_threshold) > num_online_cpus()) > > - do_flush_stats(); > > + do_unified_stats_flush(); > > } > > > > -void mem_cgroup_flush_stats_ratelimited(void) > > +/* > > + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic= flusher > > + * is late. > > + */ > > +void mem_cgroup_try_flush_stats_ratelimited(void) > > { > > if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > } > > > > static void flush_memcg_stats_dwork(struct work_struct *w) > > @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_str= uct *w) > > * Always flush here so that flushing in latency-sensitive paths = is > > * as cheap as possible. > > */ > > - do_flush_stats(); > > + do_unified_stats_flush(); > > queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_T= IME); > > } > > > > @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *= memcg, struct seq_buf *s) > > * > > * Current memory state: > > */ > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { > > u64 size; > > @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *= m, void *v) > > int nid; > > struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (stat =3D stats; stat < stats + ARRAY_SIZE(stats); stat++) { > > seq_printf(m, "%s=3D%lu", stat->name, > > @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup = *memcg, struct seq_buf *s) > > > > BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) !=3D ARRAY_SIZE(memcg1= _stats)); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i =3D 0; i < ARRAY_SIZE(memcg1_stats); i++) { > > unsigned long nr; > > @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb= , unsigned long *pfilepages, > > struct mem_cgroup *memcg =3D mem_cgroup_from_css(wb->memcg_css); > > struct mem_cgroup *parent; > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > *pdirty =3D memcg_page_state(memcg, NR_FILE_DIRTY); > > *pwriteback =3D memcg_page_state(memcg, NR_WRITEBACK); > > @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file = *m, void *v) > > int i; > > struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > for (i =3D 0; i < ARRAY_SIZE(memory_stats); i++) { > > int nid; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index c7c149cb8d66..457a18921fda 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, = struct scan_control *sc) > > * Flush the memory cgroup stats, so that we read accurate per-me= mcg > > * lruvec stats for heuristics. > > */ > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > > > /* > > * Determine the scan balance between anon and file LRUs. > > diff --git a/mm/workingset.c b/mm/workingset.c > > index da58a26d0d4d..affb8699e58d 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *= shadow) > > } > > > > /* Flush stats (and potentially sleep) before holding RCU read lo= ck */ > > - mem_cgroup_flush_stats_ratelimited(); > > + mem_cgroup_try_flush_stats_ratelimited(); > > > > rcu_read_lock(); > > > > @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shri= nker *shrinker, > > struct lruvec *lruvec; > > int i; > > > > - mem_cgroup_flush_stats(); > > + mem_cgroup_try_flush_stats(); > > lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid= )); > > for (pages =3D 0, i =3D 0; i < NR_LRU_LISTS; i++) > > pages +=3D lruvec_page_state_local(lruvec, > > -- > > 2.42.0.rc2.253.gd59a3bf2b4-goog > > -- > Michal Hocko > SUSE Labs