Re: Advice on cgroup rstat lock

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shakeel Butt <shakeel.butt@linux.dev>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	Waiman Long <longman@redhat.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	 Jesper Dangaard Brouer <jesper@cloudflare.com>,
	"David S. Miller" <davem@davemloft.net>,
	 Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Shakeel Butt <shakeelb@google.com>,
	 Arnaldo Carvalho de Melo <acme@kernel.org>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	 kernel-team <kernel-team@cloudflare.com>,
	cgroups@vger.kernel.org, Linux-MM <linux-mm@kvack.org>,
	 Netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	 Ivan Babrou <ivan@cloudflare.com>
Subject: Re: Advice on cgroup rstat lock
Date: Tue, 16 Apr 2024 11:41:15 -0700	[thread overview]
Message-ID: <f6daabzdesdwo7zdouexow5mdub3qnzr7e67lonmhh3itjgk5j@qw3xpvqoyb7j> (raw)
In-Reply-To: <9f6333ec-f28c-4a91-b7b9-07a028d92225@kernel.org>

On Tue, Apr 16, 2024 at 04:22:51PM +0200, Jesper Dangaard Brouer wrote:

Sorry for the late response and I see there are patches posted as well
which I will take a look but let me put somethings in perspective.

> 
> 
> > 
> > I personally don't like mem_cgroup_flush_stats_ratelimited() very
> > much, because it is time-based (unlike memcg_vmstats_needs_flush()),
> > and a lot of changes can happen in a very short amount of time.
> > However, it seems like for some workloads it's a necessary evil :/
> > 

Other than obj_cgroup_may_zswap(), there is no other place which really
need very very accurate stats. IMO we should actually make ratelimited
version the default one for all the places. Stats will always be out of
sync for some time window even with non-ratelimited flush and I don't
see any place where 2 second old stat would be any issue.

> 
> I like the combination of the two mem_cgroup_flush_stats_ratelimited()
> and memcg_vmstats_needs_flush().
> IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec?

4 sec is the worst case and I don't think anyone have seen or reported
that they are seeing 4 sec delayed flush and if it is happening, it
seems like no one cares. 

> 
> 
> > I briefly looked into a global scheme similar to
> > memcg_vmstats_needs_flush() in core cgroups code, but I gave up
> > quickly. Different subsystems have different incomparable stats, so we
> > cannot have a simple magnitude of pending updates on a cgroup-level
> > that represents all subsystems fairly.
> > 
> > I tried to have per-subsystem callbacks to update the pending stats
> > and check if flushing is required -- but it got complicated quickly
> > and performance was bad.
> > 
> 
> I like the time-based limit because it doesn't require tracking pending
> updates.
> 
> I'm looking at using a time-based limit, on how often userspace can take
> the lock, but in the area of 50ms to 100 ms.

Sounds good to me and you might just need to check obj_cgroup_may_zswap
is not getting delayed or getting stale stats.

> 
> 
> With a mutex lock contention will be less obvious, as converting this to
> a mutex avoids multiple CPUs spinning while waiting for the lock, but
> it doesn't remove the lock contention.
> 

I don't like global sleepable locks as those are source of priority
inversion issues on highly utilized multi-tenant systems but I still
need to see how you are handling that.

> Userspace can easily triggered pressure on the global cgroup_rstat_lock
> via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/).
> I think we need a system to mitigate lock contention from userspace
> (waiting on code compiling with a proposal).  We see normal userspace
> stats tools like cadvisor, nomad (and systemd) trigger this by reading
> all the stat file on the system and even spawning parallel threads
> without realizing that kernel side they share same global lock.
> 
> You have done a huge effort to mitigate lock contention from memcg,
> thank you for that.  It would be sad if userspace reading these stat
> files can block memcg.  On production I see shrink_node having a
> congestion point happening on this global lock.

Seems like another instance where we should use the ratelimited version
of the flush function.

next prev parent reply	other threads:[~2024-04-16 18:41 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <7cd05fac-9d93-45ca-aa15-afd1a34329c6@kernel.org>
     [not found] ` <20240319154437.GA144716@cmpxchg.org>
     [not found]   ` <56556042-5269-4c7e-99ed-1a1ab21ac27f@kernel.org>
     [not found]     ` <CAJD7tkYbO7MdKUBsaOiSp6-qnDesdmVsTCiZApN_ncS3YkDqGQ@mail.gmail.com>
     [not found]       ` <bf94f850-fab4-4171-8dfe-b19ada22f3be@kernel.org>
     [not found]         ` <CAJD7tkbn-wFEbhnhGWTy0-UsFoosr=m7wiJ+P96XnDoFnSH7Zg@mail.gmail.com>
2024-04-09 11:08           ` Jesper Dangaard Brouer
2024-04-09 15:37             ` Waiman Long
2024-04-09 16:45               ` Yosry Ahmed
2024-04-09 16:59                 ` Waiman Long
2024-04-11 10:17                   ` Jesper Dangaard Brouer
2024-04-11 17:22                     ` Yosry Ahmed
2024-04-12 19:26                       ` Jesper Dangaard Brouer
2024-04-12 19:51                         ` Yosry Ahmed
2024-04-16 14:22                           ` Jesper Dangaard Brouer
2024-04-16 18:41                             ` Shakeel Butt [this message]
2024-04-18  2:04                               ` Yosry Ahmed
2024-04-11 10:52             ` Arnaldo Carvalho de Melo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f6daabzdesdwo7zdouexow5mdub3qnzr7e67lonmhh3itjgk5j@qw3xpvqoyb7j \
    --to=shakeel.butt@linux.dev \
    --cc=acme@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=bristot@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=ivan@cloudflare.com \
    --cc=jesper@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox