From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFB5DC36010 for ; Tue, 1 Apr 2025 17:40:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8BC62280004; Tue, 1 Apr 2025 13:40:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 869AE280001; Tue, 1 Apr 2025 13:40:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70B64280004; Tue, 1 Apr 2025 13:40:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4F608280001 for ; Tue, 1 Apr 2025 13:40:13 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 807481C8E25 for ; Tue, 1 Apr 2025 17:40:14 +0000 (UTC) X-FDA: 83286188748.16.3275080 Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf12.hostedemail.com (Postfix) with ESMTP id F058D4000F for ; Tue, 1 Apr 2025 17:40:12 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Uy3aodr5; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of snitzer@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=snitzer@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743529213; a=rsa-sha256; cv=none; b=PEGNsLmi85kaLWMRwdVe1FDYfeipRr9GWFS3nPEb9t8ERQC6MkHEWnIQk9D2nUDiuJQCuA Hs857Ky3ksGbAK3z5Q6p2q4BqE+zbzQraXCrtKmbjElbuf1WQKvV3tLhia8gd7+hSStgy+ ItDQf+q55gFL4DC6RGBbaao6zdpwFgo= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Uy3aodr5; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of snitzer@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=snitzer@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743529213; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m75kSPcD6CLMSAwB8r/Khl4CvyhJPfBsBgeh5ivZZkA=; b=GX32kGskxC8qwJoRdZvk/W9HqcVtXQUlK52J+7hTke5pAB52Um9TPwmUacQrKytdo64nQ9 zZ/hp51iZCb/uO8BmRYYsjAP0+/ae3vxwvpJlEOr060Ju92EyE9T1bUcC9+AiJEi25kXGQ 9GHlg6f1u9qrb7K7ngE3ii34qhAWso4= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 5E4046112D; Tue, 1 Apr 2025 17:40:05 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 893A9C4CEE4; Tue, 1 Apr 2025 17:40:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1743529211; bh=Mw8CR9OVuNY6T1ztrVVJxZ5BPmGvReqQymmbilhFZ0Q=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Uy3aodr5X74mQpSDs1LNJPpApOsLxJgmb2cglaPqDdyv3Tw6xav+nuLJNWFFoYgNT l8mWgVQrkD2AMNIT04c2yRuv6O8iQbsO3EIN+NUly950MKoNnnoTK6+DfLgruYOp3m RyAXnyIpCXXmfhfYc7wzudwaJ+ml5YytrlNSX3Z5BkgpSIkPT7ypSmj7FNSiCXm76W lLbTMTDWI1lofG+d1ahE4IdoxuYEMlj8x5SMcVt0aNBtqidHe4AmGNgy66JsGBswDl Tqf/V7mPDL98pqLOVLmNBs5rgosqou5I4+Xq1+2CGu4VWvEV3gqfXgRsc1WAC+gal/ jhQc4eppsToVQ== Date: Tue, 1 Apr 2025 13:40:10 -0400 From: Mike Snitzer To: Rik Theys Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org Subject: Re: Memory reclaim and high nfsd usage Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: F058D4000F X-Stat-Signature: 61wq963d7bcr7dbnb5i33u55bj4dn4io X-Rspam-User: X-HE-Tag: 1743529212-874546 X-HE-Meta: U2FsdGVkX1+f7gQFf8LJwY3DerAe+OQ/pq55vMVZTay9QCBv/WnjNud0IgOVjWQ7ijwRP8uN4AmhKPl+RV46DR5y5CxXvxSvrbKkTQWhqXh065EHLf6msHCBxazsRcmTAVPsikF60gathY9b6nvi+fI/kiRkrcqXG7/cChzsNFB5LnbeJCkrJdClhrtbMJWbcCqK/l7uCeCH7IaGY+g4c/MPpNDyo+o6z9Y4AbERkERG/lIVi68yqM+yJD82yavrTqXd5vol2dW+AyLmrVYHWnVvGUDAy1OGEL7vHTeNFyXJL10VFlG9S9vDT3UON/3eNVHjS94zGG9M7RE5GOuOaswo/d2MGLL6tKFH7f4cgNtytc1JXfZWGZ94X4KgYbfjTheQz0hZRU3tSNw1NXlp2Q9/JBgDOkPjeUNT+9L1tqXa8wp3vwcmEaUGzlND9uodA4Xs6/megSpbik32lP9bhBzFKu2HlgUvmTlflSSwg8tO8tvYtXWRyjEgpw12DsKP1EWGh6HgPE+Ih0vpQAeeUcO3HAsONk1qimj6iMls6jsVwpLZbd6FbpDMCHEqwAoFEQkZCHLiWDtp2xxax01sjGxFV/ZFrPRz2BgMTXcf+bm/ZFFnJb6pqRSWHj/KzW0JGjisripFZ1y7Ts/bKB6xYGeq6I+ro1KgR2YVYZa9dAmcQuv3IfS6nbV3L8HOVO/jX0Tw2xR0BiITFuPTqGpDyHrAR7Gk4RtFn7VPpj+SE2YeT5VuRqSzYSoe4xnOWfhH3iM5FJtQf+MP9Ai2cyE/EzSBVlX0Ypk3jip2bCXa813arSMi9kmk7+RIe0B8B2fw/Rr+3M/pRV88pVd9JCD7omMTGEFk488jiVCOQCN/Pl0VCfBVLGUr0eHOMUGuHu6T3xL9Vq1NVIcq8Vzt0SjOcc0Iy/JwmHpsmj7nTZRFWpz78ZGKjqftogRy5gvP9a/rScWEwIByqNLwygy5VY/ LGy7924s HcIDynk2v0twPz4DOFkmAYnGtZ4rs6ivtWUd6iBSugxPcid4FbuwH/YjTsYCLb1wGDQoh0F69elMQpmiXp0EXTZR7vqoUyOJtrt5AFlDckwQ56GwOGoLLXNOI9lyWA66GXTccjH+R+cXPxdHn4YrphivpXi+taVqbf95XkgCSzABFYS64ZBjHccOWImEC/3OdaRoapR34l704O8jJerBnSW6H4uwwfrnwIb3Zol8c5lBakoekWO+hen7Xtnkmq5b+MvwY/IDLbYGkP5NaajlszswUE9xINwhvhsRhVSpInzqlpbv7vCC8Ws+NYC81tnwykeTYsglNAhgUX3U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote: > Hi, > > Our fileserver is currently running 6.12.13 with the following 3 > patches (from nfsd-testing) applied to it: > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr > - skip-sending-CB_RECALL_ANY > - fix-cb_getattr_status-fix > > Frequently the load on the system goes up and top shows a lot of > kswapd and kcompact threads next to nfsd threads. During these period > (which can last for hours), users complain about very slow NFS access. > We have approx 260 systems connecting to this server and the number of > nfs client states (from the states files in the clients directory) are > around 200000. Are any of these clients connecting to a server from the same host? Only reason I ask is I fixed a recursion deadlock that manifested in testing when memory was very low and LOCALIO used to loopback mount on the same host. See: ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback") https://git.kernel.org/linus/ce6d9c1c2b5cc785 (I suspect you aren't using NFS loopback mounts at all otherwise your report would indicate breadcrumbs like I mentioned in my commit, e.g. "task kcompactd0:58 blocked for more than 4435 seconds"). > When I look at our monitoring logs, the system has frequent direct > reclaim stalls (allocstall_movable, and some allocstall_normal) and > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly > is about 50. So it seems the system is out of memory and is constantly > trying to free pages? If I understand it correctly the system hits a > threshold which makes it scan for pages to free, frees some pages and > when it stops it very quickly hits the low watermark again? > > But the system has over 150G of memory dedicated to cache, and > slab_reclaim is only about 16G. Why is the system not dropping more > caches to free memory instead of constantly looking to free memory? Is > there a tunable that we can set so the system will prefer to drop > caches and increase memory usage for other nfsd related things? Any > tips on how to debug where the memory pressure is coming from, or why > the system decides to keep the pages used for cache instead of freeing > some of those? All good questions, to which I don't have immediate answers (but others may). Just FYI: there is a slow-start development TODO to leverage 6.14's "DONTCACHE" support (particularly in nfsd, but client might benefit some too) to avoid nfsd writeback stalls due to memory being fragmented and reclaim having to work too hard (in concert with kcompactd) to find adequate pages. > I've ran a perf record for 10s and the top 4 of the events seem to be: > > 1. 54% is swapper in intel_idle_ibrs > 2. 12% is swapper in intel_idle > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath: > 4. 5% is kswapd0 in __list_del_entry_valid_or_report 10s is pretty short... might consider a longer sample and then use the perf.data to generate a flamegraph, e.g.: - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph you will likely need to install some missing deps, e.g.: yum install perl-open.noarch - export FLAME=/root/git/FlameGraph - perf record -F 99 -a -g sleep 120 - this will generate a perf.data output file. Once you have perf.data output, generate a flamegraph file (named perf.svg) using these 2 commands: perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded $FLAME/flamegraph.pl out.perf-folded > perf.svg Open the perf.svg image with your favorite image viewer (a web browser works well). I just find flamegraph way more useful than 'perf report' ranked ordering. > Are there any know memory management changes related to NFS that have > been introduced that could explain this behavior? What steps can I > take to debug the root cause of this? Looking at iftop there isn't > much going on regarding throughput. The top 3 NFS4 server operations > are sequence 9563/s), putfh(9032/s) and getattr (7150/s). You'd likely do well to expand the audience to include MM too (now cc'd).