Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	tj@kernel.org, cgroups@vger.kernel.org,  hannes@cmpxchg.org,
	lizefan.x@bytedance.com, longman@redhat.com,
	 kernel-team@cloudflare.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes
Date: Tue, 25 Jun 2024 09:00:03 -0700	[thread overview]
Message-ID: <CAJD7tkZ0ReOjoioACyxQ848qNMh6a93hH616jNPgX3j72thrLg@mail.gmail.com> (raw)
In-Reply-To: <d3b5f10a-2649-446c-a6f9-9311f96e7569@kernel.org>

[..]
> >
> > Basically, I prefer that we don't skip flushing at all and keep
> > userspace and in-kernel users the same. We can use completions to make
> > other overlapping flushers sleep instead of spin on the lock.
> >
>
> I think there are good reasons for skipping flushes for userspace when
> reading these stats. More below.
>
> I'm looking at kernel code to spot cases where the flush MUST to be
> completed before returning.  There are clearly cases where we don't need
> 100% accurate stats, evident by mem_cgroup_flush_stats_ratelimited() and
> mem_cgroup_flush_stats() that use memcg_vmstats_needs_flush().
>
> The cgroup_rstat_exit() call seems to depend on cgroup_rstat_flush()
> being strict/accurate, because need to free the percpu resources.

Yeah I think this one cannot be skipped.

>
> The obj_cgroup_may_zswap() have a comments that says it needs to get
> accurate stats for charging.

This one needs to be somewhat accurate to respect memcg limits. I am
not sure how much inaccuracy we can tolerate.

>
> These were the two cases, I found, do you know of others?

Nothing that screamed at me, but as I mentioned, the non-deterministic
nature of this makes me uncomfortable and feels to me like a potential
way to get subtle bugs.

>
>
> > A proof of concept is basically something like:
> >
> > void cgroup_rstat_flush(cgroup)
> > {
> >      if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) {
> >          wait_for_completion_interruptible(&cgroup_under_flush->completion);
> >          return;
> >      }
>
> This feels like what we would achieve by changing this to a mutex.

The main difference is that whoever is holding the lock still cannot
sleep, while waiters can (and more importantly, they don't disable
interrupts). This is essentially a middle ground between a mutex and a
lock. I think this dodges the priority inversion problem Shakeel
described because a low priority job holding the lock cannot sleep.

Is there an existing locking primitive that can achieve this?

>
> >
> >      __cgroup_rstat_lock(cgrp, -1);
> >      reinit_completion(&cgroup->completion);
> >      /* Any overlapping flush requests after this write will not spin
> > on the lock */
> >      WRITE_ONCE(cgroup_under_flush, cgroup);
> >
> >      cgroup_rstat_flush_locked(cgrp);
> >      complete_all(&cgroup->completion);
> >      __cgroup_rstat_unlock(cgrp, -1);
> > }
> >
> > There may be missing barriers or chances to reduce the window between
> > __cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind.
> > I think it's not too complicated, but we need to check if it fixes the
> > problem.
> >
> > If this is not preferable, then yeah, let's at least keep the
> > userspace behavior intact. This makes sure we don't affect userspace
> > negatively, and we can change it later as we please.
>
> I don't think userspace reading these stats need to be 100% accurate.
> We are only reading the io.stat, memory.stat and cpu.stat every 53
> seconds. Reading cpu.stat print stats divided by NSEC_PER_USEC (1000).
>
> If userspace is reading these very often, then they will be killing the
> system as it disables IRQs.
>
> On my prod system the flush of root cgroup can take 35 ms, which is not
> good, but this inaccuracy should not matter for userspace.
>
> Please educate me on why we need accurate userspace stats?

My point is not about accuracy, although I think it's a reasonable
argument on its own (a lot of things could change in a short amount of
time, which is why I prefer magnitude-based ratelimiting).

My point is about logical ordering. If a userspace program reads the
stats *after* an event occurs, it expects to get a snapshot of the
system state after that event. Two examples are:

- A proactive reclaimer reading the stats after a reclaim attempt to
check if it needs to reclaim more memory or fallback.
- A userspace OOM killer reading the stats after a usage spike to
decide which workload to kill.

I listed such examples with more detail in [1], when I removed
stats_flush_ongoing from the memcg code.

[1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@google.com/

next prev parent reply	other threads:[~2024-06-25 16:00 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-24 11:55 Jesper Dangaard Brouer
2024-06-24 12:46 ` Yosry Ahmed
2024-06-24 17:32   ` Shakeel Butt
2024-06-24 17:40     ` Yosry Ahmed
2024-06-24 19:29       ` Shakeel Butt
2024-06-24 19:37         ` Yosry Ahmed
2024-06-24 20:18           ` Shakeel Butt
2024-06-24 21:43             ` Yosry Ahmed
2024-06-24 22:17               ` Shakeel Butt
     [not found]                 ` <CAJD7tka0b52zm=SjqxO-gxc0XTib=81c7nMx9MFNttwVkCVmSg@mail.gmail.com>
2024-06-25  0:24                   ` Shakeel Butt
     [not found]                     ` <CAJD7tkaMeevj2TS_aRj_WXVi26CuuBrprYwUfQmszJnwqqJrHw@mail.gmail.com>
2024-06-25 15:32                       ` Jesper Dangaard Brouer
2024-06-25 16:00                         ` Yosry Ahmed [this message]
2024-06-25 16:21                           ` Shakeel Butt
2024-06-25 20:45                             ` Yosry Ahmed
2024-06-25 21:20                               ` Shakeel Butt
2024-06-25 21:24                                 ` Yosry Ahmed
2024-06-25 22:35                                   ` Christoph Lameter (Ampere)
2024-06-25 22:59                                     ` Yosry Ahmed
2024-06-26 21:35                                       ` Jesper Dangaard Brouer
2024-06-26 22:07                                         ` Yosry Ahmed
2024-06-27  9:21                                           ` Jesper Dangaard Brouer
2024-06-27 10:36                                             ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkZ0ReOjoioACyxQ848qNMh6a93hH616jNPgX3j72thrLg@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=longman@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox