From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 45D34C36010
	for <linux-mm@archiver.kernel.org>; Tue,  1 Apr 2025 20:44:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1E8AE280002; Tue,  1 Apr 2025 16:44:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 196DF280001; Tue,  1 Apr 2025 16:44:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 036C2280002; Tue,  1 Apr 2025 16:44:14 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id D7AD9280001
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 16:44:14 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id E9A77160906
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 20:44:15 +0000 (UTC)
X-FDA: 83286652470.30.5EBEB67
Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49])
	by imf05.hostedemail.com (Postfix) with ESMTP id F0DFF10000D
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 20:44:13 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=B7KU7ZLY;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf05.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743540254;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=;
	b=neZSdCIi9ey+K7bZNOTebF3zR53ue27mG03k2TGzEglsWBWxbcys25tOkE6qnOfH6+BVdU
	O5SXLRtV1RzAcSmARJluH2rRiyg+fx0BcGGZlc/X71ayUApNf58jE06kH70V1Oec2vu+PW
	y9C33AM8ITdQtOgi6fBLH7Qr7RLToNs=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=B7KU7ZLY;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf05.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743540254; a=rsa-sha256;
	cv=none;
	b=Ro/NRl0pYRMSocn6G4pTdPkbWJcnjMizDCPLdCTK10q40gsQY7eLttvFZnTBtzgE+sG2Xa
	Eb8hnOF6LTx9Q65c8g0yIB+VgRJsTcAyQhw3y4DuD+7QQqdrSV4WHbhGVUR/XRHFqfuLlH
	8jad9hRvGxlMmSPhninDp+AG7i6MH6g=
Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5e60cfef9cfso9724803a12.2
        for <linux-mm@kvack.org>; Tue, 01 Apr 2025 13:44:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1743540252; x=1744145052; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=;
        b=B7KU7ZLY7gihAEa05l7KTyCrDgO3imrp4RjHsZ29SCv2VWTwYSzJI6EircFoRNv5Mm
         Gq4IVsgU7b2o7zydoaU90l4/RIqraqFXZDBRAyMoz4YsX0Om7BoIibVPGuXJVa+rLYWr
         TJSO0bNlSL865W7Oe1yIWqmz7/PTOx77YDd0msIzCUVgJAEcVngR36QHW0J+Hpqdir/B
         LUHGaei9ecrogZ48psyFrM158086y8iW3xch6tAu3qiQoPYzaGI+UYqqYlq2KMDZhaS7
         TOIRSkJlx3g9C8x0SyFgRRIlAOUCZWg53qzXIdedj7w2fIn2mSm5dJ0DP01ucm8s0eXo
         XXyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743540252; x=1744145052;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=;
        b=g4HKfMoEChZ+olESFK5OBLDhE9H9oZTYIDVgDs/fctfRwlZdGgQyiVCfg0te+YU0wG
         UX2AA/B6XgicosnOPDyLl/6+imTO9TaoBqV2o8WMcUsIFzDgf9IUflsNLKT0oev2VaWj
         ZiPsIgcMStZodGUkz1i0SvEvyDl+F9P1cZ//kh9L5MUcs6T0vP3PMgJmZSCtqvhx2Sqi
         PaGZisrg7frDdZx/l2PYstwUrrwBBev8d+0Xz3yL+UUt9dZzAyP39ROzVreYChQ7YF/r
         3VuL/kK6hbfyotLaA1No+xr7sm98hiHqcpQv6FO7G+k838s6oXuE4xvU0sNa2M93jPt+
         G8jA==
X-Forwarded-Encrypted: i=1; AJvYcCWKs0KtYQ2MTNdm3QSTnd/VTJFnvwmL5gAuvy+7PvSch3DORtr2f3RYwzK4i4ytfuWgT6DAgL15Ww==@kvack.org
X-Gm-Message-State: AOJu0YzRKra7aj0Rdx8Ivx+6IxHMxKJyy4TxFp6U5G7kgX9aso/kY+uC
	hmH6Uxl0kyqecX5Q3kXT1LyWCawaRKhHEhE/16okJORhh/4OHvtuHDcvNamKStrZ2YCMlMfQQVo
	jUQ92V5ZGx3SlFLAF4wiXCKHd2a4=
X-Gm-Gg: ASbGnctkTxogKcuStgsTxjCCM6XEozfxW0dbaG7HSigG2UPVzG/Q5kgHAiWOsSUhl24
	p4889sb3eG7SimlTIZ/+2pz/uD1egUPggNA8GdSFbYhK8zbnML1EyWvhMumPGt/zZK/B4CJtXFb
	qN7L9ijrwf+dhJr9SwugBi/3XySk9mIma9TXJ7wXC9
X-Google-Smtp-Source: AGHT+IFZ3vDKfqYFeqkW+L3WitrmVCiANf6CjE0w5FK7PrCTzjnLSSoG0Wx+WnxQAbG0U+Y4gQ9+n2JtRafkPrUhvjg=
X-Received: by 2002:a05:6402:84f:b0:5e8:bced:9ee5 with SMTP id
 4fb4d7f45d1cf-5edfd1229f5mr10788631a12.18.1743540251896; Tue, 01 Apr 2025
 13:44:11 -0700 (PDT)
MIME-Version: 1.0
References: <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
 <Z-wk-sJXi0dzttM_@kernel.org> <CAPwv0J=m9N6oBy+_Y-cVeBaT5_feqsc36+sb=ECrXezFXO68wA@mail.gmail.com>
 <CAPwv0J=rYJJEzrXOcVtMVDc7xNeEbSPdWVis_nAGinPc=fd6ng@mail.gmail.com>
In-Reply-To: <CAPwv0J=rYJJEzrXOcVtMVDc7xNeEbSPdWVis_nAGinPc=fd6ng@mail.gmail.com>
From: Rik Theys <rik.theys@gmail.com>
Date: Tue, 1 Apr 2025 22:43:59 +0200
X-Gm-Features: AQ5f1JpNIrY7WUSSiGyWsmBcRmhjoUI3KasqHEcexx5C2__qAT90D6XAeJZP8xs
Message-ID: <CAPwv0JmRGT9b4LeubKABWOsc97U0i6_kJyMAJQ2K7qoexSB=zA@mail.gmail.com>
Subject: Re: Memory reclaim and high nfsd usage
To: Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: F0DFF10000D
X-Stat-Signature: 1474jcpqgue3ejj6wmm4osc3f64ny3oj
X-Rspam-User: 
X-HE-Tag: 1743540253-502176
X-HE-Meta: U2FsdGVkX1/F8nu0w4XoIqOU4qknnE31WdUH4A/kqkH9M8D3Owr4QcT3pZK+3qa6geoPJHNkq1mBr7scnbSmMguiKVKW9ZOLSjfMCfPBVlxHaKcwcWoJvz0jONVtuKNh9YUYQWUDENz/n4cdXM4Sc2F+rYFZeunh39p6aWqSnqqz3GK/9Yo+AARKbBo1CwKWqlVEgx+yNKE505HXq10DI1f2U4Bek7IBi1DiZhB+nExlnc5sDydXH19vTjfyiUXtfpS7AwF2MzbbgVgI/7xwrHMWg2EwF9I/3jXKABuNyazYHKdOkc4OwUmDc3SSfWMNGUWVXBu2YrEDjg5omiQcq3lSGkaobAwIx6YnOueS8PqcE3FNwhHIYldL8ZbXur4sCTU5EEJHJIZaYbOsCyioSARqczdhGRS3yW2wBdAKBKoal8UqG/f/Alg7LWl0HXdHG65SIeYgo/sQlCCyLgPEHUW882RPYFE5ZInePvhky3yGiBhfge9zsGkpExL22tzRC8mP7EpbpP/F+P7oNxOgjOQU5MydBe0YtcSZ9EVoKd/f8IpHiOdaBvJA6cGFnQhXXtvr24z9w32CGJJnuQ36LqUPgkiXRP2hS5xQHpNfJ7l9ttIYf1K5U6VQ2Pql3Wf1Dxba5TxwNeYCavrxOpQ3UNPV9If//u3C5XyeWGMPJaE+ySf3B0aAsHMlnOml/qkIUh6GBzGI5GSORh5Jx6vSSajY+Z7woQpXfYjYEm2mym8b80ydiE7K9F6aFZeiVNtZkw68noPY/kUVWkBhSazqwYBFa2TqrsGqcE4Lodyd1fbHX1kTFFAkj2NzhtwnULQDNZBc4YAqEi4vVbgknVFIpKldoi0AdgK1yEtewG5L95uHbUokesRGd9uRZeK7R+/gYlT6rWsjo4jbWKlAiOjnDxKLYcdba0JpxENagtlK6oF/hDovQwNDoATUhMnqHpgh93QhDZS2kgXCNJxn6Ry
 mn65nh5A
 dLQFRNLKPQ7MVs9Z3k3pnNemNjTP3hfKRREHnyKT7eGI06GgZj4e6798ZL7VSjh0r7U9W2ZL11rwcWYif+1ya/aMVxRB1zI3iyH9tNaSohtMsIw/ZAqn3g9Tjul4zIGHrfSb6MM5gWe1dFCIEWi+75XlqgnHQDpxj0KXZ2VYDIGneqm/+PAVWFHqy7BOaefCzsKKHBI35VVBoTDF82oAQZc51AZS8GVQb5vEQw+z1uk58fbTL6emeTzwe7h4UaX7EZvmWhPjJOWuZjiFB638B4+9ZtZwzYXjNtXGiLYdwC5p5rGdtsFRjJOtlNKozcxx3PhU485nYf++W588d0ADN35jcA+clzcWFJNRMKoYqvMGMx/864TyZTWLCeX8a3dbCyHAj
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi,

On Tue, Apr 1, 2025 at 10:07=E2=80=AFPM Rik Theys <rik.theys@gmail.com> wro=
te:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 9:31=E2=80=AFPM Rik Theys <rik.theys@gmail.com> wr=
ote:
> >
> > Hi,
> >
> > On Tue, Apr 1, 2025 at 7:40=E2=80=AFPM Mike Snitzer <snitzer@kernel.org=
> wrote:
> > >
> > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > > Hi,
> > > >
> > > > Our fileserver is currently running 6.12.13 with the following 3
> > > > patches (from nfsd-testing) applied to it:
> > > >
> > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > > - skip-sending-CB_RECALL_ANY
> > > > - fix-cb_getattr_status-fix
> > > >
> > > > Frequently the load on the system goes up and top shows a lot of
> > > > kswapd and kcompact threads next to nfsd threads. During these peri=
od
> > > > (which can last for hours), users complain about very slow NFS acce=
ss.
> > > > We have approx 260 systems connecting to this server and the number=
 of
> > > > nfs client states (from the states files in the clients directory) =
are
> > > > around 200000.
> > >
> > > Are any of these clients connecting to a server from the same host?
> > > Only reason I ask is I fixed a recursion deadlock that manifested in
> > > testing when memory was very low and LOCALIO used to loopback mount o=
n
> > > the same host.  See:
> > >
> > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via k=
compactd writeback")
> > > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> > >
> > > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > > report would indicate breadcrumbs like I mentioned in my commit,
> > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
> >
> > Normally the server does not NFS mount itself. We also don't have any
> > "blocked task" messages reported in dmesg.
> >
> > >
> > > > When I look at our monitoring logs, the system has frequent direct
> > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickl=
y
> > > > is about 50. So it seems the system is out of memory and is constan=
tly
> > > > trying to free pages? If I understand it correctly the system hits =
a
> > > > threshold which makes it scan for pages to free, frees some pages a=
nd
> > > > when it stops it very quickly hits the low watermark again?
> > > >
> > > > But the system has over 150G of memory dedicated to cache, and
> > > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > > caches to free memory instead of constantly looking to free memory?=
 Is
> > > > there a tunable that we can set so the system will prefer to drop
> > > > caches and increase memory usage for other nfsd related things? Any
> > > > tips on how to debug where the memory pressure is coming from, or w=
hy
> > > > the system decides to keep the pages used for cache instead of free=
ing
> > > > some of those?

Could this be related to
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.=
git/commit/?h=3Dlinux-6.12.y&id=3De21ce310556ec40b5b2987e02d12ca7109a33a61:

mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream.

This is fixed in a later 6.12.x kernel, but we're still running
6.12.13 currently.

Regards,
Rik

> >
> > The issue is currently not happening, but I've looked at some of our
> > sar statistics from today:
> >
> > # sar -B
> > 04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
> > pgscank/s pgscand/s pgsteal/s    %vmeff
> > 04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
> > 10817339.49      0.00  10623.60      0.10
> > 04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
> > 8464619.82      0.00   7756.98      0.09
> > 04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
> > 9689653.88      0.00   8123.21      0.08
> > 04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
> > 9268350.05     11.73   5339.54      0.06
> > 04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
> > 9797882.50      0.00   7892.46      0.08
> > 04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
> > 6885422.57      0.00   7541.08      0.11
> >
> > # sar -r
> > 04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
> > kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
> > 04:00:00 PM   3942896 180501232   2652336      1.35  29594476
> > 138477148   3949924      1.50  48038428 120797592     13324
> > 04:10:03 PM   4062416 180601484   2564852      1.31  29574180
> > 138589324   3974652      1.51  47664880 121277920    157472
> > 04:20:05 PM   4131172 180150888   3013128      1.54  29669384
> > 138076684   3969232      1.51  47325688 121184212      4448
> > 04:30:03 PM   4112388 180835756   2344936      1.20  30338956
> > 138145972   3883420      1.48  49014976 120205032      5072
> > 04:40:05 PM   3892332 179390408   3428992      1.75  30559972
> > 137103196   3852380      1.46  48939020 119461684    306336
> > 04:50:02 PM   4328220 180002072   3197120      1.63  30873116
> > 136567640   3891224      1.48  49335740 118841092      3412
> >
> > # sar -W
> > 04:00:00 PM  pswpin/s pswpout/s
> > 04:00:00 PM      0.09      0.29
> > 04:10:03 PM      0.33      0.60
> > 04:20:05 PM      0.20      0.38
> > 04:30:03 PM      0.69      0.33
> > 04:40:05 PM      0.36      0.72
> > 04:50:02 PM      0.30      0.46
> >
> > If I read this correctly, the systems is scanning scanning for free
> > pages (pgscand) and freeing some of them (pgfree), but the efficiency
> > is low (%vmeff).
> > At the same time, the amount of memory used (kbmemused / %memused) is
> > quite low as most of the memory is used as cache. There's approx 120G
> > of inactive memory.
> > So I'm at loss as to why the system is performing these page scans and
> > stalling instead of dropping some of the cache and using that instead.
> >
> > >
> > > All good questions, to which I don't have immediate answers (but
> > > others may).
> > >
> > > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > > some too) to avoid nfsd writeback stalls due to memory being
> > > fragmented and reclaim having to work too hard (in concert with
> > > kcompactd) to find adequate pages.
> > >
> > > > I've ran a perf record for 10s and the top 4 of the events seem to =
be:
> > > >
> > > > 1. 54% is swapper in intel_idle_ibrs
> > > > 2. 12% is swapper in intel_idle
> > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> > >
> > > 10s is pretty short... might consider a longer sample and then use th=
e
> > > perf.data to generate a flamegraph, e.g.:
> > >
> > > - Download Flamegraph project: git clone https://github.com/brendangr=
egg/FlameGraph
> > >   you will likely need to install some missing deps, e.g.:
> > >   yum install perl-open.noarch
> > > - export FLAME=3D/root/git/FlameGraph
> > > - perf record -F 99 -a -g sleep 120
> > >   - this will generate a perf.data output file.
> > >
> > > Once you have perf.data output, generate a flamegraph file (named
> > > perf.svg) using these 2 commands:
> > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> > >
> > > Open the perf.svg image with your favorite image viewer (a web browse=
r
> > > works well).
> > >
> > > I just find flamegraph way more useful than 'perf report' ranked
> > > ordering.
> >
> > That's a very good idea, thanks. I will try that when the issue returns=
.
>
> The kswapd process started to consume some cpu again, so I've followed
> this procedure. See the file in attach.
>
> Does this show some sort of locking contention?
>
> Regards,
> Rik
>
> >
> > >
> > > > Are there any know memory management changes related to NFS that ha=
ve
> > > > been introduced that could explain this behavior? What steps can I
> > > > take to debug the root cause of this? Looking at iftop there isn't
> > > > much going on regarding throughput. The top 3 NFS4 server operation=
s
> > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> > >
> > > You'd likely do well to expand the audience to include MM too (now cc=
'd).
> >
> > Thanks. All ideas on how I can determine the root cause of this is appr=
eciated.
> >
> >
> > Regards,
> > Rik
>
>
>
> --
>
> Rik


--=20

Rik