From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 796D1C67861
	for <linux-mm@archiver.kernel.org>; Tue,  9 Apr 2024 16:46:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C690B6B0082; Tue,  9 Apr 2024 12:46:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C3E546B0083; Tue,  9 Apr 2024 12:46:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B05EB6B0087; Tue,  9 Apr 2024 12:46:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 8C6876B0082
	for <linux-mm@kvack.org>; Tue,  9 Apr 2024 12:46:05 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 02A1E160464
	for <linux-mm@kvack.org>; Tue,  9 Apr 2024 16:46:04 +0000 (UTC)
X-FDA: 81990570690.06.21FE322
Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46])
	by imf14.hostedemail.com (Postfix) with ESMTP id D85B4100005
	for <linux-mm@kvack.org>; Tue,  9 Apr 2024 16:46:01 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=dHMWe2qn;
	spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1712681162;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=VO0sU5Arn6X6D0Lm7h4afSioZzK6RHLZrYO4mvTd0q4=;
	b=d3TUji2Se1VUZdW6wT6/UJk7D3JpciK3qM4ljtCMz0WapS/EuoAoXs6OsEXBYMnO4TrFH/
	R1EkCR25owOLxeCNTQ/k03eyfbV8Jtn6zf2cACtB4Z+SSdobJslTk/lsPUUSAyKN2G3YJI
	e2OKjSEXTSoZSlGLxU2wpbrX8c2iPXk=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=dHMWe2qn;
	spf=pass (imf14.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712681162; a=rsa-sha256;
	cv=none;
	b=iBWNNxbx+SP6PcDLquU+HsuVb8hO50KMyzUT6Zn3bbKKJuYA5YbURIf28/8mBxhYUVebOs
	+8Prum21DL/WPsNBDIkGZ0MKHzVi2ROMalrNyPKkL/SROIgc0hwODU4XLA1MRpO/qrokCK
	wo+nDys2xbE8tDaePZptbozB/tXJb9o=
Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-56e2ac1c16aso5861590a12.0
        for <linux-mm@kvack.org>; Tue, 09 Apr 2024 09:46:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1712681160; x=1713285960; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VO0sU5Arn6X6D0Lm7h4afSioZzK6RHLZrYO4mvTd0q4=;
        b=dHMWe2qnOJ+z0+IxEcD8OApPXbBplJ0/iZ5qdFgJPmzhglnv3JRtc8yLOE5/KMmyLf
         oBgYgwdg2EoAtGHl0c0/Napu5AMN41jHojfoS1kFdWxg5/h/ay1L+5FK8IGsQGtuVI8K
         9l3Sr3GsrbwgqIQ9trrppYkX5ctrD8WaHLcP7Y/ey1NSRb5SmBxvlQMgxN55qlntZ5Eb
         hWcvALFe/8vMRRceYJNd/HvyTPN85qCnpTQv2eLU22gIpyh3xVBwBNjum7dE7IOUQ9+6
         mDtMXxoQndDjpqpmg1UofbB27iva45BZPIGG99AVW+bhIkhzkYBYDApJTNW4MXzEh/vj
         vFUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1712681160; x=1713285960;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VO0sU5Arn6X6D0Lm7h4afSioZzK6RHLZrYO4mvTd0q4=;
        b=wMTAZGigNTo6JYvROzswfbwkmX+QIN8XkCL/f+fSAD2QP1UpGwBuqr5h78mBm5K3Wo
         j9nprOcM5GEQ8qfIYMuTqphTyFBDprf+CGoODNkPBXCJs2TVjUuNrH4l0bxVrNa01tiV
         3puiKsPzBSlEsjsALQRt9kjpNRpN0nZvinWB/kr1Qb67TWZ0phHJZ5v7DbXL14QbRlSP
         CSbzhuI3VaSP65HIIu4582i564Kk5qbXfFlCBsqKWveZ7vAbqpSxSVk6N2NHxZSfioGF
         UAwsLDv9iU/36O4uME2QA6cp2+0ta1th9xdcp/0AtDRuZ/BvEY6RFCgWHXeAReqGTSrv
         OQnw==
X-Forwarded-Encrypted: i=1; AJvYcCVk4DU1WvdzZ8jbl+ZE3DyYgS6h5Pkl6s1Ci2nX/eoVhWNRiIn1FN7Db6GevBEea23PEgVX59bC3iGdJ09HZLwhYrY=
X-Gm-Message-State: AOJu0YxLQMYILM0ljb33ODpZiDOPtvTBQ5JoJdmk3wOMbWXQTG9+Rgwk
	pkltqyB00ejrlWJZFZ32vysRVa7+93lXyxjHyr2Q/V7G8y15DVmdYGRU5fKVeef3brEWbZMO1zz
	tLbOZoUWtmmr1845OYJRUwK8aOmcyJDj6kinF
X-Google-Smtp-Source: AGHT+IHbJIK6+gNvsrEEhA3aRLXLxWwMGZFgt2rB8VsNwxjaanEbTa+QMHBaWUOnv0UHfwCq83FRne/wbfIulHTT99c=
X-Received: by 2002:a17:906:248f:b0:a51:d5ce:b79e with SMTP id
 e15-20020a170906248f00b00a51d5ceb79emr8906ejb.47.1712681159869; Tue, 09 Apr
 2024 09:45:59 -0700 (PDT)
MIME-Version: 1.0
References: <7cd05fac-9d93-45ca-aa15-afd1a34329c6@kernel.org>
 <20240319154437.GA144716@cmpxchg.org> <56556042-5269-4c7e-99ed-1a1ab21ac27f@kernel.org>
 <CAJD7tkYbO7MdKUBsaOiSp6-qnDesdmVsTCiZApN_ncS3YkDqGQ@mail.gmail.com>
 <bf94f850-fab4-4171-8dfe-b19ada22f3be@kernel.org> <CAJD7tkbn-wFEbhnhGWTy0-UsFoosr=m7wiJ+P96XnDoFnSH7Zg@mail.gmail.com>
 <ac4cf07f-52dd-454f-b897-2a4b3796a4d9@kernel.org> <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com>
In-Reply-To: <96728c6d-3863-48c7-986b-b0b37689849e@redhat.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 9 Apr 2024 09:45:21 -0700
Message-ID: <CAJD7tkZrVjhe5PPUZQNoAZ5oOO4a+MZe283MVTtQHghGSxAUnA@mail.gmail.com>
Subject: Re: Advice on cgroup rstat lock
To: Waiman Long <longman@redhat.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>, 
	Jesper Dangaard Brouer <jesper@cloudflare.com>, "David S. Miller" <davem@davemloft.net>, 
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Shakeel Butt <shakeelb@google.com>, 
	Arnaldo Carvalho de Melo <acme@kernel.org>, Daniel Bristot de Oliveira <bristot@redhat.com>, 
	kernel-team <kernel-team@cloudflare.com>, cgroups@vger.kernel.org, 
	Linux-MM <linux-mm@kvack.org>, Netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, 
	LKML <linux-kernel@vger.kernel.org>, Ivan Babrou <ivan@cloudflare.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: D85B4100005
X-Rspam-User: 
X-Stat-Signature: nz8fxkxjxjobrchkx6dz9bkizfdrdphr
X-Rspamd-Server: rspam01
X-HE-Tag: 1712681161-830226
X-HE-Meta: U2FsdGVkX182/DoMeQNdSkRztS7c8ScREJuSpjUdfk4UyW2AfDaMweVHeEQuQcyVcl3SI8Apie5onbOM9EGsytW1QoIlExvG4xrJb6B1Sj8KicZiTxfYDB2SFufjqYD5MPni8DMx+lrkpjVib6wJmNg6xckzBU2udArJMeG5lvItn6GW/UB722tKlo5hDqD0iR9gXnKVZCYSfWYEd6wAVWew/A1727O7bVzSZRc6zJpSe/r1aAy3xPT9Y7H6iK6YA8d/lSXoEmv3qsxAI1+Ae+0PSDNlLFSfpwr/xnz+CTnBeEYd8AOIdLbugGa7uLVzD5S+Z1tp2hL5rCjhNb/xE7lEV+zC4x4BvGrL89LWgviDr/q0T6NksFf9o4lFQYG5RhSzTF0K4k+toEhtyPIPNiM3XZCpFDrcAiQ/QxBXTkPoUNsz6P7zQ6aOOepqAIN6Ud7FYS1QJzqaaqRND87GUFmn5ZEa1TbCaO+GfCEMUaHp+Cq7Gm73v8BBG1Hi7jbB1uh4p43/dGz1UVDNrcTJTiJiZi83XuOfptqFb5CJS4+hGIqtWb2D7hriFRZJOYYZ92ywZH5VClXDRY/dQY4IB2kIVCy4w4OR+kt5a/txyGMCwKaWLkmASjf7E8nd3OhNCr8RLq6F79Wutx1ZekPzZBTBB8Emq1fw5gLe8j+6K7zz8AeGt6c6kkNkir6do8oJZ9Z8cr7J9LuuLtKogp8Eh9LEAo5TCbWpXq17ty330MqRQu1COeodfYo0kyDUgwhJrjWt3ip/SR8Hf6OHap5YxwJKSPT2EwqG/dbGpX98hkzJ37H0/7ccKpQ3N/Qc18pVu2QrSXaAwz8oDRyVBOMqMFuGG5vVmjbi2slWlnmwM93d38XuL3qM7rmEYqVsWc5LgGi4pErgKlOfuUaQUFf69qX9pu5M5/tLFRA50CIziLglrjv6WOwtJkSP8EvOY7V10FFEvPmT4ty0G11Xkdf
 pDCh72yq
 Q0Yy0DjRgK7oJwv1qmSAhC0fm4m/wocngMOTPOB3EoEttDPnKIx8FD4XwNtVC74WGq8b0Qgmn/bG5IENc9bIxJp9ylh6kmX+B+Xb33g2p5LqvUNjgW3C6pTzNWiMA8ASaJt1d+djCwYbikTzEqBP04lKjUFM7el6j/FPn8rNtqVzBuNnXn8NT8AqEKfT3MYtKNazMiRUk/Mjpu/V2wA6lhoVbMtnQIU3s3bOMRQr031KTTZLahr2Z6EorNhSocHxrc+pnewUoS/Yn8oC31S9TlGY3sboda1GjTBJHnoYmsnfnieTonB6GwJVUK4idjnvH6PfYeUXLEoW7CkGDY1sI8JJyCGjQWlZNCjlmwtEzVXYxBHH1VQNgq2y9Iw8DGbtYPLmZwrAhHfWhL/gbPivI/BJJcHIEvfDn4VM6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000027, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 9, 2024 at 8:37=E2=80=AFAM Waiman Long <longman@redhat.com> wro=
te:
>
> On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
> > Let move this discussion upstream.
> >
> > On 22/03/2024 19.32, Yosry Ahmed wrote:
> >> [..]
> >>>> There was a couple of series that made all calls to
> >>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
> >>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
> >>>> problem that we used to face (except in our case, we saw hard lockup=
s
> >>>> in extreme scenarios):
> >>>> https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed=
@google.com/
> >>>>
> >>>> https://lore.kernel.org/lkml/20230421174020.2994750-1-yosryahmed@goo=
gle.com/
> >>>>
> >>>
> >>> I've only done the 6.6 backport, and these were in 6.5/6.6.
> >
> > Given I have these in my 6.6 kernel. You are basically saying I should
> > be able to avoid IRQ-disable for the lock, right?
> >
> > My main problem with the global cgroup_rstat_lock[3] is it disables IRQ=
s
> > and (thereby also) BH/softirq (spin_lock_irq).  This cause production
> > issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
> > latency issues (debug via softirq_net_latency.bt [5]).
> >
> >   [3]
> > https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#=
L10
> >   [5]
> > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/so=
ftirq_net_latency.bt
> >
> >
> >>> And between 6.1 to 6.6 we did observe an improvement in this area.
> >>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
> >>> progress)
> >>>
> >>> I've had a chance to get running in prod for 6.6 backport.
> >>> As you can see in attached grafana heatmap pictures, we do observe an
> >>> improved/reduced softirq wait time.
> >>> These softirq "not-able-to-run" outliers is *one* of the prod issues =
we
> >>> observed.  As you can see, I still have other areas to improve/fix.
> >>
> >> I am not very familiar with such heatmaps, but I am glad there is an
> >> improvement with 6.6 and the backports. Let me know if there is
> >> anything I could do to help with your effort.
> >
> > The heatmaps give me an overview, but I needed a debugging tool, so I
> > developed some bpftrace scripts [1][2] I'm running on production.
> > To measure how long time we hold the cgroup rstat lock (results below).
> > Adding ACME and Daniel as I hope there is an easier way to measure lock
> > hold time and congestion. Notice tricky release/yield in
> > cgroup_rstat_flush_locked[4].
> >
> > My production results on 6.6 with backported patches (below signature)
> > vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
> > shows how long time the lock+IRQs were disabled (taking into account it
> > can be released in the loop [4]).
> >
> > Patched kernel:
> >
> > 21:49:02  time elapsed: 43200 sec
> > @lock_time_hist_ns:
> > [2K, 4K)              61 |      |
> > [4K, 8K)             734 |      |
> > [8K, 16K)         121500 |@@@@@@@@@@@@@@@@      |
> > [16K, 32K)        385714
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K)        145600 |@@@@@@@@@@@@@@@@@@@      |
> > [64K, 128K)       156873 |@@@@@@@@@@@@@@@@@@@@@      |
> > [128K, 256K)      261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [256K, 512K)      291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      =
|
> > [512K, 1M)        101859 |@@@@@@@@@@@@@      |
> > [1M, 2M)           19866 |@@      |
> > [2M, 4M)           10146 |@      |
> > [4M, 8M)           30633 |@@@@      |
> > [8M, 16M)          40365 |@@@@@      |
> > [16M, 32M)         21650 |@@      |
> > [32M, 64M)          5842 |      |
> > [64M, 128M)            8 |      |
> >
> > And normal 6.6 kernel:
> >
> > 21:48:32  time elapsed: 43200 sec
> > @lock_time_hist_ns:
> > [1K, 2K)              25 |      |
> > [2K, 4K)            1146 |      |
> > [4K, 8K)           59397 |@@@@      |
> > [8K, 16K)         571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      =
|
> > [16K, 32K)        542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> > [32K, 64K)        202810 |@@@@@@@@@@@@@      |
> > [64K, 128K)       134564 |@@@@@@@@@      |
> > [128K, 256K)       72870 |@@@@@      |
> > [256K, 512K)       56914 |@@@      |
> > [512K, 1M)         83140 |@@@@@      |
> > [1M, 2M)          170514 |@@@@@@@@@@@      |
> > [2M, 4M)          396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> > [4M, 8M)          755537
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [8M, 16M)         231222 |@@@@@@@@@@@@@@@      |
> > [16M, 32M)         76370 |@@@@@      |
> > [32M, 64M)          1043 |      |
> > [64M, 128M)           12 |      |
> >
> >
> > For the unpatched kernel we see more events in 4ms to 8ms bucket than
> > any other bucket.
> > For patched kernel, we clearly see a significant reduction of events in
> > the 4 ms to 64 ms area, but we still have some events in this area.  I'=
m
> > very happy to see these patches improves the situation.  But for networ=
k
> > processing I'm not happy to see events in area 16ms to 128ms area.  If
> > we can just avoid disabling IRQs/softirq for the lock, I would be happy=
.
> >
> > How far can we go... could cgroup_rstat_lock be converted to a mutex?
>
> The cgroup_rstat_lock was originally a mutex. It was converted to a
> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
> a spinlock"). Irq was disabled to enable calling from atomic context.
> Since commit 0a2dc6ac3329 ("cgroup: remove
> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
> atomic context anymore. Theoretically, we could change it back to a
> mutex or not disabling interrupt. That will require that the API cannot
> be called from atomic context going forward.

I think we should avoid flushing from atomic contexts going forward
anyway tbh. It's just too much work to do with IRQs disabled, and we
observed hard lockups before in worst case scenarios.

I think one problem that was discussed before is that flushing is
exercised from multiple contexts and could have very high concurrency
(e.g. from reclaim when the system is under memory pressure). With a
mutex, the flusher could sleep with the mutex held and block other
threads for a while.

I vaguely recall experimenting locally with changing that lock into a
mutex and not liking the results, but I can't remember much more. I
could be misremembering though.

Currently, the lock is dropped in cgroup_rstat_flush_locked() between
CPU iterations if rescheduling is needed or the lock is being
contended (i.e. spin_needbreak() returns true). I had always wondered
if it's possible to introduce a similar primitive for IRQs? We could
also drop the lock (and re-enable IRQs) if IRQs are pending then.