From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8B225C4345F
	for <linux-mm@archiver.kernel.org>; Thu, 11 Apr 2024 17:23:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 13E786B0089; Thu, 11 Apr 2024 13:23:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0EEB06B008A; Thu, 11 Apr 2024 13:23:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F1F7B6B008C; Thu, 11 Apr 2024 13:23:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id D6E9E6B0089
	for <linux-mm@kvack.org>; Thu, 11 Apr 2024 13:23:39 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 97E5A1A0C59
	for <linux-mm@kvack.org>; Thu, 11 Apr 2024 17:23:39 +0000 (UTC)
X-FDA: 81997922958.03.631BB9E
Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46])
	by imf12.hostedemail.com (Postfix) with ESMTP id CBDC940007
	for <linux-mm@kvack.org>; Thu, 11 Apr 2024 17:23:37 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Je2mUccl;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712856217; a=rsa-sha256;
	cv=none;
	b=qBZm1g/TPKpyQFWvinmssgFGQ8IaYgfeGkRI97JpRR3xbrat0tiqVOKdBv/z85rOCRU+CA
	O2nRXdSFiIIbKEPJzhBDmKBFqfxi9wBvfAF0zyAjX7hhfQxefwcrVmO5o4GXQl3+48PE2w
	DaR7HZSApV7pkS7RHqVYTuRKeCtGMKI=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Je2mUccl;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1712856217;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A4qS/f14n6VYsG/mTYuJLf9YDLKR3IKQGN3luAlgFA8=;
	b=EaywTz6UXr9nLcSxOV6+VuSMfeMisbVS9b3xEeaeJZ8pT85fsr1HpEmRiWYLAsXLHqDzRK
	XEo618+wuOWhTe5K113GorVnXDK4vLbIIs0ORpRYCGGI1ZezkErWNWm30bRyDgorvc7ZE8
	+cvwRoFVNJ+Y1nTYsyt+5JsK37qUBQw=
Received: by mail-ej1-f46.google.com with SMTP id a640c23a62f3a-a4702457ccbso4287366b.3
        for <linux-mm@kvack.org>; Thu, 11 Apr 2024 10:23:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1712856216; x=1713461016; darn=kvack.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=A4qS/f14n6VYsG/mTYuJLf9YDLKR3IKQGN3luAlgFA8=;
        b=Je2mUccl1kvhZncGWGTeyk3RLcpPkejJsW5AmTeUvFnIJh0wG8yeGlIxnoSgeUgjpg
         E4KnRIrNcfBYZJ0htzoD6K0em2VKppQvdUbsuUue7qTb7HLqsPXbvR8X7dzI6s+rZbd2
         hbeLVURldmvEU9cdZQ+S252vs93N+Aj8NnNhDjoiyWY+WVNyf8YRjZiLx/IYuMqAsG57
         tymwolnmBIvdbOBsc0um7YCU+H+zuwYX2TYA3212xwz+K6N9J1zU9+uBCmkDvN2+GXpD
         wmfUzIykKX70WjCErrnv/7F3YPKK3hTTXF1vE68LIRE9cDaUydst+wRf09jwYliUHnNb
         oG9w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1712856216; x=1713461016;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=A4qS/f14n6VYsG/mTYuJLf9YDLKR3IKQGN3luAlgFA8=;
        b=sUnWFoi05Dchawwx6m4CGMTkvC+12lq8BNiV4WAAyQG790g9Bl7u2vIMZxT0T6G3pL
         POsv5NGzoffrIp8uJZW6PyMoKtEHwOG49ULlX40QqKn6O3L3CMy6Twi1AigjA1+x01Bs
         sllv+dovIlGLS2fuIuXHAuq//7JLOmN9G8DbyUm4rCjpbFI6q3PFcyWuQHP8VnnCwfAZ
         Thig0YvLLSMoZRbwnFriP8Uu5ItodVPEJud+ml0qVbnJP1w7e2CFoxi5CnYAJuKJiP9D
         6+m5xRnZel5Tb0DIzAVw3ps4J1Arhou4Bgr4d53S9e7uAfbZJVx8kPOcWSzqAF21bcjn
         xIyQ==
X-Forwarded-Encrypted: i=1; AJvYcCW91/SzS2rMcFkPCEFu8rc8FmstATrepVw51v9ozsCNNamXtCnMAPrDlJADgu5ixkmKcuMkn+m1IUQVp62Q4vrnfmA=
X-Gm-Message-State: AOJu0Yyf9wfDgCuq00hVQZAwETxCRznGS78Xf77wjtx6Kj7CT82b8Tdg
	5DoHvrhFi1gtvYqYRlx5Qx5q72OxKM5n3/6qb0R4H7Elvpguiy+twNiPXllLn7PUQM6ZiFLu6nv
	CkAMuVSc6mMZS7l0QlvpbXht2R8SclpjwFBhK
X-Google-Smtp-Source: AGHT+IEr3KDGRsJ/RmhpOTZqJmy4xdrhgzrBK3ek65s26q9i0M577XGNWIdOLEhYfncSNBc68UZ/N66EchXxkt1Cwo4=
X-Received: by 2002:a17:906:f6d5:b0:a51:d2cf:ddf6 with SMTP id
 jo21-20020a170906f6d500b00a51d2cfddf6mr240975ejb.3.1712856215807; Thu, 11 Apr
 2024 10:23:35 -0700 (PDT)
MIME-Version: 1.0
References: <7cd05fac-9d93-45ca-aa15-afd1a34329c6@kernel.org>
 <20240319154437.GA144716@cmpxchg.org> <56556042-5269-4c7e-99ed-1a1ab21ac27f@kernel.org>
 <CAJD7tkYbO7MdKUBsaOiSp6-qnDesdmVsTCiZApN_ncS3YkDqGQ@mail.gmail.com>
 <bf94f850-fab4-4171-8dfe-b19ada22f3be@kernel.org> <CAJD7tkbn-wFEbhnhGWTy0-UsFoosr=m7wiJ+P96XnDoFnSH7Zg@mail.gmail.com>
 <ac4cf07f-52dd-454f-b897-2a4b3796a4d9@kernel.org> <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com>
 <CAJD7tkZrVjhe5PPUZQNoAZ5oOO4a+MZe283MVTtQHghGSxAUnA@mail.gmail.com>
 <4fd9106c-40a6-415a-9409-c346d7ab91ce@redhat.com> <f72ab971-989e-4a1c-9246-9b8e57201b60@kernel.org>
In-Reply-To: <f72ab971-989e-4a1c-9246-9b8e57201b60@kernel.org>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 11 Apr 2024 10:22:57 -0700
Message-ID: <CAJD7tka=1AnBNFn=frp7AwfjGsZMGcDjw=xiWeqNygC5rPf6uQ@mail.gmail.com>
Subject: Re: Advice on cgroup rstat lock
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Waiman Long <longman@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>, 
	Jesper Dangaard Brouer <jesper@cloudflare.com>, "David S. Miller" <davem@davemloft.net>, 
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Shakeel Butt <shakeelb@google.com>, 
	Arnaldo Carvalho de Melo <acme@kernel.org>, Daniel Bristot de Oliveira <bristot@redhat.com>, 
	kernel-team <kernel-team@cloudflare.com>, cgroups@vger.kernel.org, 
	Linux-MM <linux-mm@kvack.org>, Netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, 
	LKML <linux-kernel@vger.kernel.org>, Ivan Babrou <ivan@cloudflare.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: CBDC940007
X-Stat-Signature: 6drqysqbx93wg77nfwp9cgn78hncrhw1
X-HE-Tag: 1712856217-466129
X-HE-Meta: U2FsdGVkX1/PfxBTseohNvgcHqtioD8m/ZxVEhjnzZ9wuEnajnRzV8R43e6asX3/jARmJFVaXMUzd7mYT6WpNLCp7wsWPjPGEzjarzXEjp9TYSSDzU3ZOEjnKBQLgSUBpBQLxl6Pf+jdTIImcnz+0YqzEikByiVzLlbU7yTQwhE0SExLygXYQGxts+W0oJit1zuX5dly2IqQoyGpnGDXoEvF+C1As32InBMb5s1CEYER8pjW/gXG3uAb7CT2yOp1lTc5GTwbSkP6RBGrEN6nOGiSKMolVPNPWtiCFhfIM8hPeO05XMZdcupquhFYmOjZMOdkuyEPPQUZlZCCuSx1tMcUvS+yecoTkXL7U+oOr1WOj4xN/NBUL3+HA2LctELfQzXEbcaiM/oJcMGzv9YqjuwULpaBCi6n5xfLLq3SSPlc7vHX0V2GB6k9bWROxTEbK4zmNGZP54Av4WJcOYm6UjWqqt2cl/a1TjnDdm+jAtG+mI+kkuLBrA+P0LThoYRhxU+1H6gi7FrnGFXPpo0fzhYIEK0EExLq2ZYyA7NNe6DDrIkR/tblPT0sQW/jAoQ/BhXLHJKW5RcRVTjIvmppIgFRSfUt0AM4ovt6pUpmn8oa/Kt/VjE4ldCUSPVxCPbPiF7b6hvw73omF+70G1+Y7aSwS2oQGykpFKyaVVnpD3FP8jLSZju6kZ4fD8H5qvgJHSmVF/zPbyhhT8JxzPxEgGeFApp8MCCa+XEBTAI74KsWnWnobynnvgDD8xdwkp0cR0/ZgQj1S7QA/vXAXWMfBV34yKi1D90aFgG4MSyFOVokpKFNBzJwCTF5KOu7Uj0X7xJtO6lmIX89+btOf/5b9zudHfkXuLcl4Nn7hNB0sgrEK4q7gOglNUhCnVSV9KtAlOrVq6YEZ2qc1P5/2ZScQD+RcgBLlx27hvYrQN9mXNItNEqpxdWzH0HLxULF+mQNqn5MTc62Nt9HlJy+9oU
 P/Sl5aZo
 azSO3DnMbWhetVagKXaaiOEo6WuRC301epsmRjD+eGRkgqv/rpl7yxm3ic3Lu+/euCH+dJ0qJUWrCdw/mAy7g2X+gPhyFS4itIK/Ipu385NXqpCC7D7TJH5K+bEqi8wpFbmFIhNO112bOaft5hJRE9sLV09FKDK5bg+SNklh3dzErP7L41DAmQAVfqi/PyIjiY1H1xZth0jwUeTnxuagzrafuSLmrVjK9L1f46LX27freCRHnOK2vnLwzXl134ARVYXJ1fWi3SYtDpjki5bl+b9o7XMEx/9c9kPsOvMNlQs52qDGZJdSO37ArQ+IrG9CzZdR7PFbkgsiVZRgIs2J//Ppy+HpI/HTy218Awp1N9cACCN9bHs/bwvbgpA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.006271, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

[..]
> >>>>
> >>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>  >>>
> >>> The cgroup_rstat_lock was originally a mutex. It was converted to a
> >>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
> >>> a spinlock"). Irq was disabled to enable calling from atomic context.
> >>> Since commit 0a2dc6ac3329 ("cgroup: remove
> >>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
> >>> atomic context anymore. Theoretically, we could change it back to a
> >>> mutex or not disabling interrupt. That will require that the API cannot
> >>> be called from atomic context going forward.
>  >>>
> >> I think we should avoid flushing from atomic contexts going forward
> >> anyway tbh. It's just too much work to do with IRQs disabled, and we
> >> observed hard lockups before in worst case scenarios.
> >>
>
> Appreciate the historic commits as documentation for how the code
> evolved.  Sounds like we agree that the IRQ-disable can be lifted,
> at-least between the three of us.

It can be lifted, but whether it should be or not is a different
story. I tried keeping it as a spinlock without disabling IRQs before
and Tejun pointed out possible problems, see below.

>
> >> I think one problem that was discussed before is that flushing is
> >> exercised from multiple contexts and could have very high concurrency
> >> (e.g. from reclaim when the system is under memory pressure). With a
> >> mutex, the flusher could sleep with the mutex held and block other
> >> threads for a while.
> >>
>
> Fair point, so in first iteration we keep the spin_lock but don't do the
> IRQ disable.

I tried doing that before, and Tejun had some objections:
https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/

My read of that thread is that Tejun would prefer we look into
converting cgroup_rsat_lock into a mutex again, or more aggressively
drop the lock on CPU boundaries. Perhaps we can unconditionally drop
the lock on each CPU boundary, but I am worried that contending the
lock too often may be an issue, which is why I suggested dropping the
lock if there are pending IRQs instead -- but I am not sure how to do
that :)

> I already have a upstream devel kernel doing this in my
> testlab, but I need to test this in prod to see the effects.  Can you
> recommend a test I should run in my testlab?

I don't know of any existing test/benchmark. What I used to do is run
a synthetic test with a lot of concurrent reclaim activity (some in
the same cgroups, some in different ones) to stress in-kernel
flushers, and a synthetic test with a lot of concurrent userspace
reads.

I would mainly look into the time it took for concurrent reclaim
operations to complete and the userspace reads latency histograms. I
don't have the scripts I used now unfortunately, but I can help with
more details if needed.

>
> I'm also looking at adding some instrumentation, as my bpftrace
> script[2] need to be adjusted to every binary build.
> Still hoping ACME will give me an easier approach to measuring lock wait
> and hold time? (without having to instrument *all* lock in system).
>
>
>   [2]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt
>
>
> >> I vaguely recall experimenting locally with changing that lock into a
> >> mutex and not liking the results, but I can't remember much more. I
> >> could be misremembering though.
> >>
> >> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
> >> CPU iterations if rescheduling is needed or the lock is being
> >> contended (i.e. spin_needbreak() returns true). I had always wondered
> >> if it's possible to introduce a similar primitive for IRQs? We could
> >> also drop the lock (and re-enable IRQs) if IRQs are pending then.
> >
> > I am not sure if there is a way to check if a hardirq is pending, but we
> > do have a local_softirq_pending() helper.
>
> The local_softirq_pending() might work well for me, as this is our prod
> problem, that CPU local pending softirq's are getting starved.

If my understanding is correct, softirqs are usually scheduled by
IRQs, which means that local_softirq_pending() may return false if
there are pending IRQs (that will schedule softirqs). Is this correct?

>
> In production another problematic (but rarely occurring issue) is when
> several CPUs contend on this lock.  Yosry's recent work/patches have
> already reduced the chances of this happening (thanks), BUT it still can
> and does happen.
> A simple solution to this, would be to do a spin_trylock() in
> cgroup_rstat_flush(), and exit if we cannot get the lock, because we
> know someone else will do the work.

I am not sure I understand what you mean specifically with the checks
below, but I generally don't like this (as you predicted :) ).

On the memcg side, we used to have similar logic when we used to
always flush the entire tree. This leaded to flushing being
indeterministic. You would occasionally get stale stats because of the
contention, which resulted in some inconsistencies (e.g. performing
proactive reclaim successfully then reading the stats that do not
reflect that).

Now that we dropped the logic to always flush the entire tree, it is
even more difficult because concurrent flushes could be in completely
irrelevant subtrees.

If we were to introduce some smart logic to figure out that the
subtree we are trying to flush is already being flushed, I think we
would need to wait for that ongoing flush to complete instead of just
returning (e.g. using completions). But I think such implementations
to find overlapping flushes and wait for them may be too compicated.

> I expect someone to complain here, as cgroup_rstat_flush() takes a
> cgroup argument, so I might starve updates on some other cgroup. I
> wonder if I can simply check if cgroup->rstat_flush_next is not NULL, to
> determine if this cgroup is the one currently being processed?