From: Yosry Ahmed <yosryahmed@google.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>,
tj@kernel.org, cgroups@vger.kernel.org, hannes@cmpxchg.org,
lizefan.x@bytedance.com, longman@redhat.com,
kernel-team@cloudflare.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes
Date: Mon, 24 Jun 2024 14:43:02 -0700 [thread overview]
Message-ID: <CAJD7tkZT_2tyOFq5koK0djMXj4tY8BO3CtSamPb85p=iiXCgXQ@mail.gmail.com> (raw)
In-Reply-To: <a45ggqu6jcve44y7ha6m6cr3pcjc3xgyomu4ml6jbsq3zv7tte@oeovgtwh6ytg>
On Mon, Jun 24, 2024 at 1:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 12:37:30PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 12:29 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:40:48AM GMT, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 10:32 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 05:46:05AM GMT, Yosry Ahmed wrote:
> > > > > > On Mon, Jun 24, 2024 at 4:55 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > > > >
> > > > > [...]
> > > > > > I am assuming this supersedes your other patch titled "[PATCH RFC]
> > > > > > cgroup/rstat: avoid thundering herd problem on root cgrp", so I will
> > > > > > only respond here.
> > > > > >
> > > > > > I have two comments:
> > > > > > - There is no reason why this should be limited to the root cgroup. We
> > > > > > can keep track of the cgroup being flushed, and use
> > > > > > cgroup_is_descendant() to find out if the cgroup we want to flush is a
> > > > > > descendant of it. We can use a pointer and cmpxchg primitives instead
> > > > > > of the atomic here IIUC.
> > > > > >
> > > > > > - More importantly, I am not a fan of skipping the flush if there is
> > > > > > an ongoing one. For all we know, the ongoing flush could have just
> > > > > > started and the stats have not been flushed yet. This is another
> > > > > > example of non deterministic behavior that could be difficult to
> > > > > > debug.
> > > > >
> > > > > Even with the flush, there will almost always per-cpu updates which will
> > > > > be missed. This can not be fixed unless we block the stats updaters as
> > > > > well (which is not going to happen). So, we are already ok with this
> > > > > level of non-determinism. Why skipping flushing would be worse? One may
> > > > > argue 'time window is smaller' but this still does not cap the amount of
> > > > > updates. So, unless there is concrete data that this skipping flushing
> > > > > is detrimental to the users of stats, I don't see an issue in the
> > > > > presense of periodic flusher.
> > > >
> > > > As you mentioned, the updates that happen during the flush are
> > > > unavoidable anyway, and the window is small. On the other hand, we
> > > > should be able to maintain the current behavior that at least all the
> > > > stat updates that happened *before* the call to cgroup_rstat_flush()
> > > > are flushed after the call.
> > > >
> > > > The main concern here is that the stats read *after* an event occurs
> > > > should reflect the system state at that time. For example, a proactive
> > > > reclaimer reading the stats after writing to memory.reclaim should
> > > > observe the system state after the reclaim operation happened.
> > >
> > > What about the in-kernel users like kswapd? I don't see any before or
> > > after events for the in-kernel users.
> >
> > The example I can think of off the top of my head is the cache trim
> > mode scenario I mentioned when discussing your patch (i.e. not
> > realizing that file memory had already been reclaimed).
>
> Kswapd has some kind of cache trim failure mode where it decides to skip
> cache trim heuristic. Also for global reclaim there are couple more
> condition in play as well.
I was mostly concerned about entering cache trim mode when we
shouldn't, not vice versa, as I explained in the other thread. Anyway,
I think the problem of missing stat updates of events is more
pronounced with userspace reads.
>
> > There is also
> > a heuristic in zswap that may writeback more (or less) pages that it
> > should to the swap device if the stats are significantly stale.
> >
>
> Is this the ratio of MEMCG_ZSWAP_B and MEMCG_ZSWAPPED in
> zswap_shrinker_count()? There is already a target memcg flush in that
> function and I don't expect root memcg flush from there.
I was thinking of the generic approach I suggested, where we can avoid
contending on the lock if the cgroup is a descendant of the cgroup
being flushed, regardless of whether or not it's the root memcg. I
think this would be more beneficial than just focusing on root
flushes.
next prev parent reply other threads:[~2024-06-24 21:43 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-24 11:55 Jesper Dangaard Brouer
2024-06-24 12:46 ` Yosry Ahmed
2024-06-24 17:32 ` Shakeel Butt
2024-06-24 17:40 ` Yosry Ahmed
2024-06-24 19:29 ` Shakeel Butt
2024-06-24 19:37 ` Yosry Ahmed
2024-06-24 20:18 ` Shakeel Butt
2024-06-24 21:43 ` Yosry Ahmed [this message]
2024-06-24 22:17 ` Shakeel Butt
[not found] ` <CAJD7tka0b52zm=SjqxO-gxc0XTib=81c7nMx9MFNttwVkCVmSg@mail.gmail.com>
2024-06-25 0:24 ` Shakeel Butt
[not found] ` <CAJD7tkaMeevj2TS_aRj_WXVi26CuuBrprYwUfQmszJnwqqJrHw@mail.gmail.com>
2024-06-25 15:32 ` Jesper Dangaard Brouer
2024-06-25 16:00 ` Yosry Ahmed
2024-06-25 16:21 ` Shakeel Butt
2024-06-25 20:45 ` Yosry Ahmed
2024-06-25 21:20 ` Shakeel Butt
2024-06-25 21:24 ` Yosry Ahmed
2024-06-25 22:35 ` Christoph Lameter (Ampere)
2024-06-25 22:59 ` Yosry Ahmed
2024-06-26 21:35 ` Jesper Dangaard Brouer
2024-06-26 22:07 ` Yosry Ahmed
2024-06-27 9:21 ` Jesper Dangaard Brouer
2024-06-27 10:36 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJD7tkZT_2tyOFq5koK0djMXj4tY8BO3CtSamPb85p=iiXCgXQ@mail.gmail.com' \
--to=yosryahmed@google.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hawk@kernel.org \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=longman@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox