From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ECD69C36010
	for <linux-mm@archiver.kernel.org>; Tue,  1 Apr 2025 19:31:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7CA71280002; Tue,  1 Apr 2025 15:31:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 775FA280001; Tue,  1 Apr 2025 15:31:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6638C280002; Tue,  1 Apr 2025 15:31:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 475D1280001
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 15:31:19 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id B0C1F160845
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 19:31:20 +0000 (UTC)
X-FDA: 83286468720.21.DBA4410
Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49])
	by imf29.hostedemail.com (Postfix) with ESMTP id EBA4012001C
	for <linux-mm@kvack.org>; Tue,  1 Apr 2025 19:31:18 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Auq6YtL9;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf29.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743535879; a=rsa-sha256;
	cv=none;
	b=2qjxvupguYkjwGUXyVweE81FIbKdtciAmvbezIRA5op5FzrlaRr2q0Jjv8ctDkx1pX8RTT
	zsBAmW7SrwbSxoBbRua7s6HXpxs5KDe8JMkmGVs5cdefeuj/k+i86kbb/smQGF+uVHd/+V
	M8RDIZOVUXb6A7DhgBf+XLsnfjk1yHs=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Auq6YtL9;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf29.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1743535879;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=;
	b=cbVtP2S2Dp6C0wLzr21yHbqsIGuCJsZECBHdglPVj6VkaY7hGVWxRBWf1b2JaUFJR5wPn8
	6yONYEiNDen2Uk9WjuDBtUmvxwczcbf99eWXaQRBw3FeJCsXhyqAqgrEMx7CMYZIOpCrB+
	igm3iEflSTBS6cNxzL0G1g1YHCalzB0=
Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5ed43460d6bso9761458a12.0
        for <linux-mm@kvack.org>; Tue, 01 Apr 2025 12:31:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1743535877; x=1744140677; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=;
        b=Auq6YtL9HptzdpdkCEUMZKvP0jPxB0fZsnbw2ff43ln4G0vYgZdkARN6IpxVCT/FVM
         VryjhefJ/N1jwP1NeAnfWlDHHgAFUmyRKPwkgPyNjXXa63q38pUttKEBYBU0GC4K/Hak
         a8hGDuyneWypdrRweE8mjBBx1p7xb/0oseKFMdtyQThey72BT3bFuK9pQud7jLENgiW9
         MZKL+S1zS9+3kBjVMOHfOBM6Wnpgp+i5AyDv1OzS9P8N3dl3RY8pw0aXPCmhmmgvk+sC
         1aO6huoIoHJRhhRG9zBTDbOpQGKikmlH+u7ZmjSq9FWj1XLzhQlekldF22myi9eRHGsp
         tuHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743535877; x=1744140677;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=;
        b=gyS6ZxNGerXUJ4q0lNr7ir2++7rTJfnFf38M3F5IHUXc2vckDgprYfgk06OjhhLcID
         WGp+dqIk5Ah7yJsaFgn0XcmNGNdA1QmlOB3AUFUKwBakAhVaknGza6ozJZRM6Y1i+DEv
         IyXwxv76ZPfzvLQbVWvC1RZ2TquQ73XQkjSsrHvxfUuSz0+KTE0gamRgf7mQZbtf9CIN
         LOtLN9m8Bs65bVP3j2gElf6qIhgQDmIbLZaeY7orwA4m78O4qqmg8bG5mYj7xyodNFAH
         V+d2KAhiLFhirUxD/anl5g9XP2MLc+PbZcbYtMHXOUH55IznSL/kHa0mBmvCb9xzdE3s
         ERWw==
X-Forwarded-Encrypted: i=1; AJvYcCUunsZy7UsGGJ/wxFtqp4FuiE6A6w9V0s6nEgKiNBJVQYBoeuTacNz+Uqw3rfKOTD/V+bK3H0hWwQ==@kvack.org
X-Gm-Message-State: AOJu0YwZ7mP0CiRN7wMZ/5wGa5kV2D4/w7oAJtMPMJgk5TGn6NPZW/qD
	iKByFQ/9HjrYwRMwEUXpzeokzkfMnuMrOW6a+A+f8MLscYLZmx1a+D9Y8TUQxPlPJdaMnikamRz
	CWzmLPbriHH/FfxtP/mFThlbOSZs=
X-Gm-Gg: ASbGncvXbu3p4RSlefzr0C9loP1XnUmxRb9fbCNmDLvJ2iMKP3OMeg8usQeSKxSx5Uk
	MIIdyqJziLyUilRjEsecw4GQ4JX3XWxpEvGP1hAEz71n9/3ieYJhIcniDhWWEExYzfb2VL566Qk
	jRpeeObGSD0QWgyzIImlhD6uWbgbE6jLQS7M3+sVrOozcKKjm5t8U=
X-Google-Smtp-Source: AGHT+IGvoDiSgP0oO5xruCzn2Jro6mS+CFnedqFIpGuTIlXCKYzcWE2ST0N+F9LNrAxub3SKeP2/vcXekXnpFD2EtZk=
X-Received: by 2002:a05:6402:40cc:b0:5ed:1444:7914 with SMTP id
 4fb4d7f45d1cf-5edfdd23b76mr13732314a12.28.1743535877000; Tue, 01 Apr 2025
 12:31:17 -0700 (PDT)
MIME-Version: 1.0
References: <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
 <Z-wk-sJXi0dzttM_@kernel.org>
In-Reply-To: <Z-wk-sJXi0dzttM_@kernel.org>
From: Rik Theys <rik.theys@gmail.com>
Date: Tue, 1 Apr 2025 21:31:05 +0200
X-Gm-Features: AQ5f1JpLdh8BUINVWTfSTBwhslY81BBFxE2PpAZx5aNA-2W0brjNjXCT2tpVPrI
Message-ID: <CAPwv0J=m9N6oBy+_Y-cVeBaT5_feqsc36+sb=ECrXezFXO68wA@mail.gmail.com>
Subject: Re: Memory reclaim and high nfsd usage
To: Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: EBA4012001C
X-Stat-Signature: 5fdmfjssh7q3nrep1nuexiszs96pxs1z
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-HE-Tag: 1743535878-279649
X-HE-Meta: U2FsdGVkX18kLkyoOJ/InnITrP+r43oJGLHfIzZooa/ovdTCjJkE0YV5ELmZ27zEbBrzsGcXjx4P2HXPe8CAAimElnNcOnJHfYKmcZwFjIyj8ElUD7Ai/JSATfbB/AMFsGjWSMYAMNcyoRVreL777J5VBhsoDDzrXv/uzGzFSSswzI+GZFPo4lsAuaPlCBxxltlW3sVTZTIZKHcL2BXFK7swMw92oMsnDOCutrZLCAsWcsQq2wFvCd+owpHz1pp/0ojEjOtqLKwDzoBdXk/wL4Cl42kbuy5HTb5HlR9rX4pMAqoTFm8PQ5UgXIlKwlhO2//3DLhpZkW5jv8smjGsxnMUcvFKfOeZPpj+paRwpZlTCyJvkV0jk5BRRj2h240j8GTnMeXjTwc+KA4eJ+aRVDTBHGaVfpjb7A1KkwqyaW3CRioz004rDa1HR/uoxZPaih13Dc2SQcLktgQvDCV6/5K2UqUAjFzcxSUaSLQ98Jhq6Y7XzaJWVpVNt7Wl0g3QGwFiB7t4Rk/6NtOp9FAsOISLKAZ4ScloIxfHOzVH/CUFNLjBjvXjL3cyMyfmqIFjP0ufMsvi1D0o9VN1an5vVYcOyUwpanbs+tjSG5b+uimPhRDsvZfxg2v4xSqZRQgotvoE6TWrRBEKi81FzkW7EieIf3eFAyz2gYn0aQn4vWUWrb7/SeNssSgvuZKcdJ3kmSDeLfcXmVoAAhjD6D9Lc6OW35bX1Ax6qOeDg/GbKHj6+liyGN6MErv/PnHP2ZUCx9c7kBWd1j5Gu1qtplubMfCl/8NQZn6yqJnFvRCkIsaftxY4MCf/YNvPvRvUy6Y+FaqlkDAiJgwOsO566s0P/3OOGAJ5o6K1bmrmMZEyJxlKhMS17T51lacxcBJE0Swl1tSSKNKdL0RhcGsCVowWOOdzAW6xbOkw9VVuVtgdZ6DE50fwwG5tBpQtDRenUGGTa9Kt+SmuypmTtRUfaeD
 BP0VwGD0
 tAkPr1BkeHISZBN0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi,

On Tue, Apr 1, 2025 at 7:40=E2=80=AFPM Mike Snitzer <snitzer@kernel.org> wr=
ote:
>
> On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > Hi,
> >
> > Our fileserver is currently running 6.12.13 with the following 3
> > patches (from nfsd-testing) applied to it:
> >
> > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > - skip-sending-CB_RECALL_ANY
> > - fix-cb_getattr_status-fix
> >
> > Frequently the load on the system goes up and top shows a lot of
> > kswapd and kcompact threads next to nfsd threads. During these period
> > (which can last for hours), users complain about very slow NFS access.
> > We have approx 260 systems connecting to this server and the number of
> > nfs client states (from the states files in the clients directory) are
> > around 200000.
>
> Are any of these clients connecting to a server from the same host?
> Only reason I ask is I fixed a recursion deadlock that manifested in
> testing when memory was very low and LOCALIO used to loopback mount on
> the same host.  See:
>
> ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcomp=
actd writeback")
> https://git.kernel.org/linus/ce6d9c1c2b5cc785
>
> (I suspect you aren't using NFS loopback mounts at all otherwise your
> report would indicate breadcrumbs like I mentioned in my commit,
> e.g. "task kcompactd0:58 blocked for more than 4435 seconds").

Normally the server does not NFS mount itself. We also don't have any
"blocked task" messages reported in dmesg.

>
> > When I look at our monitoring logs, the system has frequent direct
> > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > is about 50. So it seems the system is out of memory and is constantly
> > trying to free pages? If I understand it correctly the system hits a
> > threshold which makes it scan for pages to free, frees some pages and
> > when it stops it very quickly hits the low watermark again?
> >
> > But the system has over 150G of memory dedicated to cache, and
> > slab_reclaim is only about 16G. Why is the system not dropping more
> > caches to free memory instead of constantly looking to free memory? Is
> > there a tunable that we can set so the system will prefer to drop
> > caches and increase memory usage for other nfsd related things? Any
> > tips on how to debug where the memory pressure is coming from, or why
> > the system decides to keep the pages used for cache instead of freeing
> > some of those?

The issue is currently not happening, but I've looked at some of our
sar statistics from today:

# sar -B
04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
pgscank/s pgscand/s pgsteal/s    %vmeff
04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
10817339.49      0.00  10623.60      0.10
04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
8464619.82      0.00   7756.98      0.09
04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
9689653.88      0.00   8123.21      0.08
04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
9268350.05     11.73   5339.54      0.06
04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
9797882.50      0.00   7892.46      0.08
04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
6885422.57      0.00   7541.08      0.11

# sar -r
04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
04:00:00 PM   3942896 180501232   2652336      1.35  29594476
138477148   3949924      1.50  48038428 120797592     13324
04:10:03 PM   4062416 180601484   2564852      1.31  29574180
138589324   3974652      1.51  47664880 121277920    157472
04:20:05 PM   4131172 180150888   3013128      1.54  29669384
138076684   3969232      1.51  47325688 121184212      4448
04:30:03 PM   4112388 180835756   2344936      1.20  30338956
138145972   3883420      1.48  49014976 120205032      5072
04:40:05 PM   3892332 179390408   3428992      1.75  30559972
137103196   3852380      1.46  48939020 119461684    306336
04:50:02 PM   4328220 180002072   3197120      1.63  30873116
136567640   3891224      1.48  49335740 118841092      3412

# sar -W
04:00:00 PM  pswpin/s pswpout/s
04:00:00 PM      0.09      0.29
04:10:03 PM      0.33      0.60
04:20:05 PM      0.20      0.38
04:30:03 PM      0.69      0.33
04:40:05 PM      0.36      0.72
04:50:02 PM      0.30      0.46

If I read this correctly, the systems is scanning scanning for free
pages (pgscand) and freeing some of them (pgfree), but the efficiency
is low (%vmeff).
At the same time, the amount of memory used (kbmemused / %memused) is
quite low as most of the memory is used as cache. There's approx 120G
of inactive memory.
So I'm at loss as to why the system is performing these page scans and
stalling instead of dropping some of the cache and using that instead.

>
> All good questions, to which I don't have immediate answers (but
> others may).
>
> Just FYI: there is a slow-start development TODO to leverage 6.14's
> "DONTCACHE" support (particularly in nfsd, but client might benefit
> some too) to avoid nfsd writeback stalls due to memory being
> fragmented and reclaim having to work too hard (in concert with
> kcompactd) to find adequate pages.
>
> > I've ran a perf record for 10s and the top 4 of the events seem to be:
> >
> > 1. 54% is swapper in intel_idle_ibrs
> > 2. 12% is swapper in intel_idle
> > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
>
> 10s is pretty short... might consider a longer sample and then use the
> perf.data to generate a flamegraph, e.g.:
>
> - Download Flamegraph project: git clone https://github.com/brendangregg/=
FlameGraph
>   you will likely need to install some missing deps, e.g.:
>   yum install perl-open.noarch
> - export FLAME=3D/root/git/FlameGraph
> - perf record -F 99 -a -g sleep 120
>   - this will generate a perf.data output file.
>
> Once you have perf.data output, generate a flamegraph file (named
> perf.svg) using these 2 commands:
> perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> $FLAME/flamegraph.pl out.perf-folded > perf.svg
>
> Open the perf.svg image with your favorite image viewer (a web browser
> works well).
>
> I just find flamegraph way more useful than 'perf report' ranked
> ordering.

That's a very good idea, thanks. I will try that when the issue returns.

>
> > Are there any know memory management changes related to NFS that have
> > been introduced that could explain this behavior? What steps can I
> > take to debug the root cause of this? Looking at iftop there isn't
> > much going on regarding throughput. The top 3 NFS4 server operations
> > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
>
> You'd likely do well to expand the audience to include MM too (now cc'd).

Thanks. All ideas on how I can determine the root cause of this is apprecia=
ted.


Regards,
Rik