From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CFB5DC36010
	for <linux-mm@archiver.kernel.org>; Tue,  1 Apr 2025 17:40:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8BC62280004; Tue,  1 Apr 2025 13:40:13 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 869AE280001; Tue,  1 Apr 2025 13:40:13 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 70B64280004; Tue,  1 Apr 2025 13:40:13 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 4F608280001
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 13:40:13 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 807481C8E25
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 17:40:14 +0000 (UTC)
X-FDA: 83286188748.16.3275080
Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254])
	by imf12.hostedemail.com (Postfix) with ESMTP id F058D4000F
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 17:40:12 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=Uy3aodr5;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf12.hostedemail.com: domain of snitzer@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=snitzer@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743529213; a=rsa-sha256;
	cv=none;
	b=PEGNsLmi85kaLWMRwdVe1FDYfeipRr9GWFS3nPEb9t8ERQC6MkHEWnIQk9D2nUDiuJQCuA
	Hs857Ky3ksGbAK3z5Q6p2q4BqE+zbzQraXCrtKmbjElbuf1WQKvV3tLhia8gd7+hSStgy+
	ItDQf+q55gFL4DC6RGBbaao6zdpwFgo=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=Uy3aodr5;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf12.hostedemail.com: domain of snitzer@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=snitzer@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743529213;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=m75kSPcD6CLMSAwB8r/Khl4CvyhJPfBsBgeh5ivZZkA=;
	b=GX32kGskxC8qwJoRdZvk/W9HqcVtXQUlK52J+7hTke5pAB52Um9TPwmUacQrKytdo64nQ9
	zZ/hp51iZCb/uO8BmRYYsjAP0+/ae3vxwvpJlEOr060Ju92EyE9T1bUcC9+AiJEi25kXGQ
	9GHlg6f1u9qrb7K7ngE3ii34qhAWso4=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by tor.source.kernel.org (Postfix) with ESMTP id 5E4046112D;
	Tue,  1 Apr 2025 17:40:05 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 893A9C4CEE4;
	Tue,  1 Apr 2025 17:40:11 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1743529211;
	bh=Mw8CR9OVuNY6T1ztrVVJxZ5BPmGvReqQymmbilhFZ0Q=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=Uy3aodr5X74mQpSDs1LNJPpApOsLxJgmb2cglaPqDdyv3Tw6xav+nuLJNWFFoYgNT
	 l8mWgVQrkD2AMNIT04c2yRuv6O8iQbsO3EIN+NUly950MKoNnnoTK6+DfLgruYOp3m
	 RyAXnyIpCXXmfhfYc7wzudwaJ+ml5YytrlNSX3Z5BkgpSIkPT7ypSmj7FNSiCXm76W
	 lLbTMTDWI1lofG+d1ahE4IdoxuYEMlj8x5SMcVt0aNBtqidHe4AmGNgy66JsGBswDl
	 Tqf/V7mPDL98pqLOVLmNBs5rgosqou5I4+Xq1+2CGu4VWvEV3gqfXgRsc1WAC+gal/
	 jhQc4eppsToVQ==
Date: Tue, 1 Apr 2025 13:40:10 -0400
From: Mike Snitzer <snitzer@kernel.org>
To: Rik Theys <rik.theys@gmail.com>
Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory reclaim and high nfsd usage
Message-ID: <Z-wk-sJXi0dzttM_@kernel.org>
References: <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: F058D4000F
X-Stat-Signature: 61wq963d7bcr7dbnb5i33u55bj4dn4io
X-Rspam-User: 
X-HE-Tag: 1743529212-874546
X-HE-Meta: U2FsdGVkX1+f7gQFf8LJwY3DerAe+OQ/pq55vMVZTay9QCBv/WnjNud0IgOVjWQ7ijwRP8uN4AmhKPl+RV46DR5y5CxXvxSvrbKkTQWhqXh065EHLf6msHCBxazsRcmTAVPsikF60gathY9b6nvi+fI/kiRkrcqXG7/cChzsNFB5LnbeJCkrJdClhrtbMJWbcCqK/l7uCeCH7IaGY+g4c/MPpNDyo+o6z9Y4AbERkERG/lIVi68yqM+yJD82yavrTqXd5vol2dW+AyLmrVYHWnVvGUDAy1OGEL7vHTeNFyXJL10VFlG9S9vDT3UON/3eNVHjS94zGG9M7RE5GOuOaswo/d2MGLL6tKFH7f4cgNtytc1JXfZWGZ94X4KgYbfjTheQz0hZRU3tSNw1NXlp2Q9/JBgDOkPjeUNT+9L1tqXa8wp3vwcmEaUGzlND9uodA4Xs6/megSpbik32lP9bhBzFKu2HlgUvmTlflSSwg8tO8tvYtXWRyjEgpw12DsKP1EWGh6HgPE+Ih0vpQAeeUcO3HAsONk1qimj6iMls6jsVwpLZbd6FbpDMCHEqwAoFEQkZCHLiWDtp2xxax01sjGxFV/ZFrPRz2BgMTXcf+bm/ZFFnJb6pqRSWHj/KzW0JGjisripFZ1y7Ts/bKB6xYGeq6I+ro1KgR2YVYZa9dAmcQuv3IfS6nbV3L8HOVO/jX0Tw2xR0BiITFuPTqGpDyHrAR7Gk4RtFn7VPpj+SE2YeT5VuRqSzYSoe4xnOWfhH3iM5FJtQf+MP9Ai2cyE/EzSBVlX0Ypk3jip2bCXa813arSMi9kmk7+RIe0B8B2fw/Rr+3M/pRV88pVd9JCD7omMTGEFk488jiVCOQCN/Pl0VCfBVLGUr0eHOMUGuHu6T3xL9Vq1NVIcq8Vzt0SjOcc0Iy/JwmHpsmj7nTZRFWpz78ZGKjqftogRy5gvP9a/rScWEwIByqNLwygy5VY/
 LGy7924s
 HcIDynk2v0twPz4DOFkmAYnGtZ4rs6ivtWUd6iBSugxPcid4FbuwH/YjTsYCLb1wGDQoh0F69elMQpmiXp0EXTZR7vqoUyOJtrt5AFlDckwQ56GwOGoLLXNOI9lyWA66GXTccjH+R+cXPxdHn4YrphivpXi+taVqbf95XkgCSzABFYS64ZBjHccOWImEC/3OdaRoapR34l704O8jJerBnSW6H4uwwfrnwIb3Zol8c5lBakoekWO+hen7Xtnkmq5b+MvwY/IDLbYGkP5NaajlszswUE9xINwhvhsRhVSpInzqlpbv7vCC8Ws+NYC81tnwykeTYsglNAhgUX3U=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> Hi,
> 
> Our fileserver is currently running 6.12.13 with the following 3
> patches (from nfsd-testing) applied to it:
> 
> - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> - skip-sending-CB_RECALL_ANY
> - fix-cb_getattr_status-fix
> 
> Frequently the load on the system goes up and top shows a lot of
> kswapd and kcompact threads next to nfsd threads. During these period
> (which can last for hours), users complain about very slow NFS access.
> We have approx 260 systems connecting to this server and the number of
> nfs client states (from the states files in the clients directory) are
> around 200000.

Are any of these clients connecting to a server from the same host?
Only reason I ask is I fixed a recursion deadlock that manifested in
testing when memory was very low and LOCALIO used to loopback mount on
the same host.  See:

ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
https://git.kernel.org/linus/ce6d9c1c2b5cc785

(I suspect you aren't using NFS loopback mounts at all otherwise your
report would indicate breadcrumbs like I mentioned in my commit,
e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
 
> When I look at our monitoring logs, the system has frequent direct
> reclaim stalls (allocstall_movable, and some allocstall_normal) and
> pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> is about 50. So it seems the system is out of memory and is constantly
> trying to free pages? If I understand it correctly the system hits a
> threshold which makes it scan for pages to free, frees some pages and
> when it stops it very quickly hits the low watermark again?
> 
> But the system has over 150G of memory dedicated to cache, and
> slab_reclaim is only about 16G. Why is the system not dropping more
> caches to free memory instead of constantly looking to free memory? Is
> there a tunable that we can set so the system will prefer to drop
> caches and increase memory usage for other nfsd related things? Any
> tips on how to debug where the memory pressure is coming from, or why
> the system decides to keep the pages used for cache instead of freeing
> some of those?

All good questions, to which I don't have immediate answers (but
others may).

Just FYI: there is a slow-start development TODO to leverage 6.14's
"DONTCACHE" support (particularly in nfsd, but client might benefit
some too) to avoid nfsd writeback stalls due to memory being
fragmented and reclaim having to work too hard (in concert with
kcompactd) to find adequate pages.

> I've ran a perf record for 10s and the top 4 of the events seem to be:
> 
> 1. 54% is swapper in intel_idle_ibrs
> 2. 12% is swapper in intel_idle
> 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> 4. 5% is kswapd0 in __list_del_entry_valid_or_report

10s is pretty short... might consider a longer sample and then use the
perf.data to generate a flamegraph, e.g.:

- Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
  you will likely need to install some missing deps, e.g.:
  yum install perl-open.noarch
- export FLAME=/root/git/FlameGraph
- perf record -F 99 -a -g sleep 120
  - this will generate a perf.data output file.

Once you have perf.data output, generate a flamegraph file (named
perf.svg) using these 2 commands:
perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
$FLAME/flamegraph.pl out.perf-folded > perf.svg

Open the perf.svg image with your favorite image viewer (a web browser
works well).

I just find flamegraph way more useful than 'perf report' ranked
ordering.
 
> Are there any know memory management changes related to NFS that have
> been introduced that could explain this behavior? What steps can I
> take to debug the root cause of this? Looking at iftop there isn't
> much going on regarding throughput. The top 3 NFS4 server operations
> are sequence 9563/s), putfh(9032/s) and getattr (7150/s).

You'd likely do well to expand the audience to include MM too (now cc'd).