From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C08E2C35FFF for ; Wed, 19 Mar 2025 18:38:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEB2B280003; Wed, 19 Mar 2025 14:38:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C73F8280001; Wed, 19 Mar 2025 14:38:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AEF01280003; Wed, 19 Mar 2025 14:38:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 88919280001 for ; Wed, 19 Mar 2025 14:38:07 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 99989B715C for ; Wed, 19 Mar 2025 18:38:08 +0000 (UTC) X-FDA: 83239160256.27.0907585 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by imf27.hostedemail.com (Postfix) with ESMTP id BA6024000A for ; Wed, 19 Mar 2025 18:38:06 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1wz+EtPD; spf=pass (imf27.hostedemail.com: domain of edumazet@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=edumazet@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742409486; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DU/Gmp19pv4CiIezKnSnELs1y6UqJ0KoV73t2Ezcm5s=; b=tNuQXWP5bzWW5cu3blUeBU6i22FVK0t+OANo2EDHbUgjB9mLEwXHv5l1qGbwm0jXkOuQdt Bwv4ysObwzSfPgAYoNZnkKdj/d+NIfE2IjAAycXfi+bRjcHjNLjxXjYbboI5zFbLXKEhtg zGSgm+kwhZqK+X4SdNEmPFSReklq4no= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=1wz+EtPD; spf=pass (imf27.hostedemail.com: domain of edumazet@google.com designates 209.85.160.169 as permitted sender) smtp.mailfrom=edumazet@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742409486; a=rsa-sha256; cv=none; b=0CgcbMDg2tCis+HLuqnQuRfXadci/PmpHDN7zH/pykAbwpA8TtgCUdXvE6mjqqHo2EEySF 1O4uCEZdq++sKVBmNEN2cW8e77IzB+XJxmUySZFdlUEqeY/gCfLN/mScoyc5i2U3jw6zVb WzXp/EvV7YtKL/MgQdvIkmLTnlQklpI= Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-47686580529so60324921cf.2 for ; Wed, 19 Mar 2025 11:38:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742409486; x=1743014286; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DU/Gmp19pv4CiIezKnSnELs1y6UqJ0KoV73t2Ezcm5s=; b=1wz+EtPDmUlIAJ73lSaY+tCzNMvZHyFViPNaGpjq7YVkZGm2dfk6oRTjicPlCkYFy/ v9nf7l8SV2zM7ta3QJ6WOx/UU4+XKlJ9Bexa2UpqfuFDFVin6TJuq2MkmQbKiMmlF6sT gRhI/HdgDWwXfuEMcZWgkGe8ZLfTar3IsJxmJdOZPpajdBdfS6Limzb5EWKlU49+kSp9 +4QEbC7NWl9g0QYmtFZg5C6QKug3nbVat1valoJ2jADXJHtEdUXeADznXGoBdPcT1V2b gUDoqaby4GBJYqs7id51IhhPMaDjsb9+HafoQq7eX+ivfbbSnJT1K6l/50wMcXbi3SUY sK0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742409486; x=1743014286; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DU/Gmp19pv4CiIezKnSnELs1y6UqJ0KoV73t2Ezcm5s=; b=C12JXws4+2+7gblXSOQ0jmZEotSBILTn8xv+pFowF0rF2gWuYq2VoaOa9TuuLtcGdc YfUpdoLkxfKk10YvYAd0IQtkbO+xZnGJ12pgzP7lhFM73I5C3WxcUg1GVrpsM2URllys RYxZC7Nqzqxk6Hhfo9EAHRaYWFqNp2scLr1OMgMMPZCDhgShPfmR9PU2XLN6+rfi7+8v C3shf6ovjWXLbWeAQ4VpgRkntx8h5BCn8m4Io/UPcBKZSz3FZij8sjdId2hSHstV/Swl EC6kfyOYeQrlyMeIE8QfcVA18t4n1ZC3R987KPGMPG7LNN3VgoLrvXPPFnO741FkU9oh qi2Q== X-Forwarded-Encrypted: i=1; AJvYcCVMK4CBeccPgmVXLzGY5RaRVfSMNfQJY2mQSlbGz54iJddPjMC/j255PHspgQOwaqovycw6j1gl+w==@kvack.org X-Gm-Message-State: AOJu0YyzqMg3vpAkXymNrEdyzecVeWUp0pxyjewKJTtpcO4CYshhImMu LNkF/i0Np67fLtT7bqk/l4+jnnKwO9oW0WiDB2EEsqx1QoOfCxCClXTz1o7Ixw8ifuOCNUQxGio IFj8i34u6bg6g6Rg8qBCYo5tSXo17+Odk85jV X-Gm-Gg: ASbGncsi2EM0SBbLQXxbha6ryJ7VFviyKbpvj/xYrCXjNMYTZ675UwSVrS3157uXuww WHiYZWuB9PsA9HpU4yFD8tItI04kcD+nBtpdEt++nn9uCQ6ZvBmXS8ibKwtXyhd174Mpys0g9i/ 0EsekCrL6SeaCWvYqlsP4jJSQB45w= X-Google-Smtp-Source: AGHT+IEME3vOIIYUhh/QJ9i3OWYatZu4lJ7ad19/EfMLxESDrP++cQ9x/jMogrPuPYrIgql6gpTcNfx7ovZ4jCLkbTA= X-Received: by 2002:a05:622a:5193:b0:477:d00:b46f with SMTP id d75a77b69052e-47710de0a3fmr9985791cf.43.1742409485516; Wed, 19 Mar 2025 11:38:05 -0700 (PDT) MIME-Version: 1.0 References: <20250319071330.898763-1-gthelen@google.com> <54wer7lbgg4mgxv7ky5zzsgjv2vi4diu7clvcklxgmrp2u4gvn@tr2twe5xdtgt> In-Reply-To: <54wer7lbgg4mgxv7ky5zzsgjv2vi4diu7clvcklxgmrp2u4gvn@tr2twe5xdtgt> From: Eric Dumazet Date: Wed, 19 Mar 2025 19:37:53 +0100 X-Gm-Features: AQ5f1Jq25zw6bJ6syei7_zEqJ6NkkSwNpRDGdhRQLPzT6tBMyDRLe9B2fuAJMEE Message-ID: Subject: Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu) To: Shakeel Butt Cc: Greg Thelen , Tejun Heo , Johannes Weiner , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Eric Dumazet , Yosry Ahmed , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: multipart/alternative; boundary="00000000000070f5a90630b65248" X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BA6024000A X-Stat-Signature: mjh9drsy6harsdz9ga3xg7r1hzknqdy5 X-HE-Tag: 1742409486-996778 X-HE-Meta: U2FsdGVkX19lcG+a2fu/uyC6r33nSZMNAmD/0OWUi6yoAROiBbTlUC/2TsaP4bWqba63WZcGe++l7lpwam0xxMZ9Qeb8JWr75P5BqU5OiquGaYhvc5FiM06jN3AqCmvHcGFma50AiEHUw1vNGuekAVgAFXG1b9vXxzNde59ah2RRBmApoQEiu4G3EA8L1KsceWojZvzkff2CsRQV2rNTV7RBOKekHEcDuHopa3Wy3PAye1xeJYU4BnG1TU+OUi9wc3MsV+km+KIC0NgJVfeNbWkd8cg7AFG2LoqKkFT+HV97I0HdZ23Ep9GdcuGSqbtXezQ6xruhQO9s42imu8FGU6G90frKh/B9QdM5BC7qUohBddTye6zqO1UHSqA8ZITGv/qVv3vgrY/ooqawzUoLc/Vjx4OtStaPCvUpwn2nt0umuJc9dI66rz8XSVW+88IhRb+v4M8DdOjuwo/E24qRYBZzVgNid1csv7TxezOadjQCHjDIi89CIv1VwQbAupnRWlftPIKVO35Fo3IPm6idYst9WW5ItzqgB3q3cFxb4ntMA/TauTtNrWR4/KOmNnVOy0LspEBpHIsxrBRXXqXOcANP1uuQcfO5hu0Qti7VKQEecI1sSENNoy6aBebu8lnb4TQ+GNFSMElcZj58hbKcjbHuy0KPqDhNsVRdQZggSgHrUsaYMwLl2rnQ4ToBjbJIRgGFze5usT95ntTkQWpqwanOCnMmlWzZtn74cQs6xBp/ubuogfxSnAZt+P7HSixaG6gLTja0UhN49I5SCw++EgT1v3XRoRdZcVr5/zgrMmWYmjJwQOJlcgY6FzEM3G2tbrwm/cwKWL0l8TyhmUIF2USCiuCwdQbJWJR0M7meePzGsarQa5Q1vSw0dQDPptMEShpfiCn4S9maGMwy335UG+Jv604iMLqdSZQLMzS3VipL0ocZd0dkQcThvnNX6h+DclsPNOmLVf0DVasiZXU Ce3FdEQO l75xCsQJXnTW0zcvXVCc4TPJkWpvGwWEz9ggSo7i9GvkPo2BdsIEB78/PD9OMmzF9arZN5fauFVDa9J63AQNkkB2AtAcTMoNSCFsLaOWV27Sx9zdPCfWEGc+7IsVgF5HpJnp1WwRcAjm2BXlrSVIdK5surycRdsPI27E77hzZGB1neuUxkJ22DzSEEkA5i7gMHedCuXzEghB/LU8bd3GCD+/YlFt/nfWOgj+EIoWbdFZb6NMkVChDI1i0Ur8NrwvfQxjBTa7bHmzLgGzpdS3Suxn9ySH/LfUf3R4t8KaXvAxXz2qKjL1cHxhWvainjX09V2fU7ovnvSW52nvwsWQKOy1qjUhCAK0g0/mblQIog+3MB2aFZbIakiCchwMINPQWA2IN//MeZZZx3Ltsx5oNlGRV0VKka/MtEGArtlVykJfeRgNwmZCJGtgAgZkKa7hXdHMH X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --00000000000070f5a90630b65248 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Mar 19, 2025 at 6:53=E2=80=AFPM Shakeel Butt wrote: > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > Which kernel was this observed on in production? > I had these issues with the latest upstream kernels, and have seen a 80 ms delay. ./trace-cgroup_rstat_flush_locked.sh Attaching 3 probes... ^C @cgroup_rstat_flush_locked_us: [64, 128) 2 |@ | [128, 256) 3 |@ | [256, 512) 2 |@ | [512, 1K) 6 |@@@ | [1K, 2K) 8 |@@@@@ | [2K, 4K) 3 |@ | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 4 |@@ | [32K, 64K) 83 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 74 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @max_us: 79214 Greg in his tests was able to get 45 ms "only" > > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. > > Doing for each cpu might be too extreme. Have you tried with some > batching? > Not really, modern platforms are doing the unlock/lock pair faster than all the cache line misses done in the per-cpu loop. > > > This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > Though this benchmark seems too extreme but userspace holding off irqs > for that long time is bad. BTW are these memory hoggers, creating anon > memory or file memory? Is [z]swap enabled? > > For the long term, I think we can use make this work without disabling > irqs, similar to how networking manages sock lock. > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # =3D> started at: _raw_spin_lock_irq > > # =3D> ended at: cgroup_rstat_flush > > # > > # > > # _------=3D> CPU# > > # / _-----=3D> irqs-off/BH-disabled > > # | / _----=3D> need-resched > > # || / _---=3D> hardirq/softirq > > # ||| / _--=3D> preempt-depth > > # |||| / _-=3D> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on > <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : > > =3D> memcg1_stat_format > > =3D> memory_stat_format > > =3D> memory_stat_show > > =3D> seq_read_iter > > =3D> vfs_read > > =3D> ksys_read > > =3D> do_syscall_64 > > =3D> entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses =3D> 12.7M accesses > > - with memory hogs: 9.46M accesses =3D> 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec =3D> 16 usec > > - with memory hogs: 64 usec =3D> 8 usec > > The memory.stat latency improves. > > So, things are improving even without batching. I wonder if there are > less readers then how will this look like. Can you try with single > reader as well? > > > > > Signed-off-by: Eric Dumazet > > Signed-off-by: Greg Thelen > > Tested-by: Greg Thelen > > Reviewed-by: Shakeel Butt > --00000000000070f5a90630b65248 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, Mar 19,= 2025 at 6:53=E2=80=AFPM Shakeel Butt <shakeel.butt@linux.dev> wrote:
On Wed, Mar 19, 2025 at 12:13:30AM -0700, = Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while=
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be<= br> > disabled for a long time. On large machines this can disable interrupt= s
> for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec.

Which kernel was this observed on in production?

<= /div>
I had these issues with the latest upstream kernels, and have see= n a 80 ms delay.

./trace-cgroup_rstat_flush_locked= .sh
Attaching 3 probes...
^C

@cgroup_rstat_flush_locked_us: [64, 128) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 |@ =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 |
[128, 256) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 3 |@ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
[256, 512) =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 2 |@ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
[512, 1K) =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A06 |@@@ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
[1K, 2K) = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8 |@@@@@ =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |
= [2K, 4K) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3 |@ =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 |
[4K, 8K) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
[8K, 16K) =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|<= br>[16K, 32K) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4 |@@ =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0|
[32K, 64K) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A083 |@= @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64K, 128K) =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 74 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= @@@ =C2=A0 =C2=A0 =C2=A0|

@max_us: 79214




Greg in his tests was able t= o get 45 ms "only"

=C2=A0

>
> Prevent rstat from disabling interrupts while processing all possible<= br> > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu.

Doing for each cpu might be too extreme. Have you tried with some
batching?

Not really, modern platforms = are doing the unlock/lock pair faster than
all the cache line=C2= =A0misses done in the per-cpu loop.

=C2=A0

> This
> approach was previously discussed in
> https://lore.kernel.org/lkml/= ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
>
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test<= br> >=C2=A0 =C2=A0 memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly >=C2=A0 =C2=A0 charging memory until oom killed. Then they repeat chargi= ng and oom
>=C2=A0 =C2=A0 killing.
>

Though this benchmark seems too extreme but userspace holding off irqs
for that long time is bad. BTW are these memory hoggers, creating anon
memory or file memory? Is [z]swap enabled?

For the long term, I think we can use make this work without disabling
irqs, similar to how networking manages sock lock.

> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds<= br> > interrupts are disabled by rstat for 45341 usec:
>=C2=A0 =C2=A0#=C2=A0 =3D> started at: _raw_spin_lock_irq
>=C2=A0 =C2=A0#=C2=A0 =3D> ended at:=C2=A0 =C2=A0cgroup_rstat_flush >=C2=A0 =C2=A0#
>=C2=A0 =C2=A0#
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 _------=3D> CPU#
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0/ _-----=3D> irqs-off/BH-disabled
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 | / _----=3D> need-resched
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 || / _---=3D> hardirq/softirq
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 ||| / _--=3D> preempt-depth
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 |||| / _-=3D> migrate-disable
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 ||||| /=C2=A0 =C2=A0 =C2=A0delay
>=C2=A0 =C2=A0#=C2=A0 cmd=C2=A0 =C2=A0 =C2=A0pid=C2=A0 =C2=A0 =C2=A0||||= || time=C2=A0 |=C2=A0 =C2=A0caller
>=C2=A0 =C2=A0#=C2=A0 =C2=A0 =C2=A0\=C2=A0 =C2=A0/=C2=A0 =C2=A0 =C2=A0 = =C2=A0 ||||||=C2=A0 \=C2=A0 =C2=A0 |=C2=A0 =C2=A0 /
>=C2=A0 =C2=A0stat_rea-96532=C2=A0 =C2=A0 52d....=C2=A0 =C2=A0 0us*: _ra= w_spin_lock_irq
>=C2=A0 =C2=A0stat_rea-96532=C2=A0 =C2=A0 52d.... 45342us : cgroup_rstat= _flush
>=C2=A0 =C2=A0stat_rea-96532=C2=A0 =C2=A0 52d.... 45342us : tracer_hardi= rqs_on <-cgroup_rstat_flush
>=C2=A0 =C2=A0stat_rea-96532=C2=A0 =C2=A0 52d.... 45343us : <stack tr= ace>
>=C2=A0 =C2=A0 =3D> memcg1_stat_format
>=C2=A0 =C2=A0 =3D> memory_stat_format
>=C2=A0 =C2=A0 =3D> memory_stat_show
>=C2=A0 =C2=A0 =3D> seq_read_iter
>=C2=A0 =C2=A0 =3D> vfs_read
>=C2=A0 =C2=A0 =3D> ksys_read
>=C2=A0 =C2=A0 =3D> do_syscall_64
>=C2=A0 =C2=A0 =3D> entry_SYSCALL_64_after_hwframe
>
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be= the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding. >
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses =3D> 12.7M accesses
> -=C2=A0 =C2=A0 with memory hogs: 9.46M accesses =3D> 11.1M accesses=
> The throughput of memory.stat access improves.
>
> The mode of memory.stat access latency after grouping by of 2 buckets:=
> - without memory hogs: 64 usec =3D> 16 usec
> -=C2=A0 =C2=A0 with memory hogs: 64 usec =3D>=C2=A0 8 usec
> The memory.stat latency improves.

So, things are improving even without batching. I wonder if there are
less readers then how will this look like. Can you try with single
reader as well?

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
--00000000000070f5a90630b65248--