From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECD69C36010 for ; Tue, 1 Apr 2025 19:31:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7CA71280002; Tue, 1 Apr 2025 15:31:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 775FA280001; Tue, 1 Apr 2025 15:31:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6638C280002; Tue, 1 Apr 2025 15:31:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 475D1280001 for ; Tue, 1 Apr 2025 15:31:19 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B0C1F160845 for ; Tue, 1 Apr 2025 19:31:20 +0000 (UTC) X-FDA: 83286468720.21.DBA4410 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf29.hostedemail.com (Postfix) with ESMTP id EBA4012001C for ; Tue, 1 Apr 2025 19:31:18 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Auq6YtL9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743535879; a=rsa-sha256; cv=none; b=2qjxvupguYkjwGUXyVweE81FIbKdtciAmvbezIRA5op5FzrlaRr2q0Jjv8ctDkx1pX8RTT zsBAmW7SrwbSxoBbRua7s6HXpxs5KDe8JMkmGVs5cdefeuj/k+i86kbb/smQGF+uVHd/+V M8RDIZOVUXb6A7DhgBf+XLsnfjk1yHs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Auq6YtL9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743535879; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=; b=cbVtP2S2Dp6C0wLzr21yHbqsIGuCJsZECBHdglPVj6VkaY7hGVWxRBWf1b2JaUFJR5wPn8 6yONYEiNDen2Uk9WjuDBtUmvxwczcbf99eWXaQRBw3FeJCsXhyqAqgrEMx7CMYZIOpCrB+ igm3iEflSTBS6cNxzL0G1g1YHCalzB0= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5ed43460d6bso9761458a12.0 for ; Tue, 01 Apr 2025 12:31:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743535877; x=1744140677; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=; b=Auq6YtL9HptzdpdkCEUMZKvP0jPxB0fZsnbw2ff43ln4G0vYgZdkARN6IpxVCT/FVM VryjhefJ/N1jwP1NeAnfWlDHHgAFUmyRKPwkgPyNjXXa63q38pUttKEBYBU0GC4K/Hak a8hGDuyneWypdrRweE8mjBBx1p7xb/0oseKFMdtyQThey72BT3bFuK9pQud7jLENgiW9 MZKL+S1zS9+3kBjVMOHfOBM6Wnpgp+i5AyDv1OzS9P8N3dl3RY8pw0aXPCmhmmgvk+sC 1aO6huoIoHJRhhRG9zBTDbOpQGKikmlH+u7ZmjSq9FWj1XLzhQlekldF22myi9eRHGsp tuHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743535877; x=1744140677; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TXgft1fVIw9RAeoX9J4VlgHHT88tq6S/UKtUqiODGTo=; b=gyS6ZxNGerXUJ4q0lNr7ir2++7rTJfnFf38M3F5IHUXc2vckDgprYfgk06OjhhLcID WGp+dqIk5Ah7yJsaFgn0XcmNGNdA1QmlOB3AUFUKwBakAhVaknGza6ozJZRM6Y1i+DEv IyXwxv76ZPfzvLQbVWvC1RZ2TquQ73XQkjSsrHvxfUuSz0+KTE0gamRgf7mQZbtf9CIN LOtLN9m8Bs65bVP3j2gElf6qIhgQDmIbLZaeY7orwA4m78O4qqmg8bG5mYj7xyodNFAH V+d2KAhiLFhirUxD/anl5g9XP2MLc+PbZcbYtMHXOUH55IznSL/kHa0mBmvCb9xzdE3s ERWw== X-Forwarded-Encrypted: i=1; AJvYcCUunsZy7UsGGJ/wxFtqp4FuiE6A6w9V0s6nEgKiNBJVQYBoeuTacNz+Uqw3rfKOTD/V+bK3H0hWwQ==@kvack.org X-Gm-Message-State: AOJu0YwZ7mP0CiRN7wMZ/5wGa5kV2D4/w7oAJtMPMJgk5TGn6NPZW/qD iKByFQ/9HjrYwRMwEUXpzeokzkfMnuMrOW6a+A+f8MLscYLZmx1a+D9Y8TUQxPlPJdaMnikamRz CWzmLPbriHH/FfxtP/mFThlbOSZs= X-Gm-Gg: ASbGncvXbu3p4RSlefzr0C9loP1XnUmxRb9fbCNmDLvJ2iMKP3OMeg8usQeSKxSx5Uk MIIdyqJziLyUilRjEsecw4GQ4JX3XWxpEvGP1hAEz71n9/3ieYJhIcniDhWWEExYzfb2VL566Qk jRpeeObGSD0QWgyzIImlhD6uWbgbE6jLQS7M3+sVrOozcKKjm5t8U= X-Google-Smtp-Source: AGHT+IGvoDiSgP0oO5xruCzn2Jro6mS+CFnedqFIpGuTIlXCKYzcWE2ST0N+F9LNrAxub3SKeP2/vcXekXnpFD2EtZk= X-Received: by 2002:a05:6402:40cc:b0:5ed:1444:7914 with SMTP id 4fb4d7f45d1cf-5edfdd23b76mr13732314a12.28.1743535877000; Tue, 01 Apr 2025 12:31:17 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Rik Theys Date: Tue, 1 Apr 2025 21:31:05 +0200 X-Gm-Features: AQ5f1JpLdh8BUINVWTfSTBwhslY81BBFxE2PpAZx5aNA-2W0brjNjXCT2tpVPrI Message-ID: Subject: Re: Memory reclaim and high nfsd usage To: Mike Snitzer Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EBA4012001C X-Stat-Signature: 5fdmfjssh7q3nrep1nuexiszs96pxs1z X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1743535878-279649 X-HE-Meta: U2FsdGVkX18kLkyoOJ/InnITrP+r43oJGLHfIzZooa/ovdTCjJkE0YV5ELmZ27zEbBrzsGcXjx4P2HXPe8CAAimElnNcOnJHfYKmcZwFjIyj8ElUD7Ai/JSATfbB/AMFsGjWSMYAMNcyoRVreL777J5VBhsoDDzrXv/uzGzFSSswzI+GZFPo4lsAuaPlCBxxltlW3sVTZTIZKHcL2BXFK7swMw92oMsnDOCutrZLCAsWcsQq2wFvCd+owpHz1pp/0ojEjOtqLKwDzoBdXk/wL4Cl42kbuy5HTb5HlR9rX4pMAqoTFm8PQ5UgXIlKwlhO2//3DLhpZkW5jv8smjGsxnMUcvFKfOeZPpj+paRwpZlTCyJvkV0jk5BRRj2h240j8GTnMeXjTwc+KA4eJ+aRVDTBHGaVfpjb7A1KkwqyaW3CRioz004rDa1HR/uoxZPaih13Dc2SQcLktgQvDCV6/5K2UqUAjFzcxSUaSLQ98Jhq6Y7XzaJWVpVNt7Wl0g3QGwFiB7t4Rk/6NtOp9FAsOISLKAZ4ScloIxfHOzVH/CUFNLjBjvXjL3cyMyfmqIFjP0ufMsvi1D0o9VN1an5vVYcOyUwpanbs+tjSG5b+uimPhRDsvZfxg2v4xSqZRQgotvoE6TWrRBEKi81FzkW7EieIf3eFAyz2gYn0aQn4vWUWrb7/SeNssSgvuZKcdJ3kmSDeLfcXmVoAAhjD6D9Lc6OW35bX1Ax6qOeDg/GbKHj6+liyGN6MErv/PnHP2ZUCx9c7kBWd1j5Gu1qtplubMfCl/8NQZn6yqJnFvRCkIsaftxY4MCf/YNvPvRvUy6Y+FaqlkDAiJgwOsO566s0P/3OOGAJ5o6K1bmrmMZEyJxlKhMS17T51lacxcBJE0Swl1tSSKNKdL0RhcGsCVowWOOdzAW6xbOkw9VVuVtgdZ6DE50fwwG5tBpQtDRenUGGTa9Kt+SmuypmTtRUfaeD BP0VwGD0 tAkPr1BkeHISZBN0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On Tue, Apr 1, 2025 at 7:40=E2=80=AFPM Mike Snitzer wr= ote: > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote: > > Hi, > > > > Our fileserver is currently running 6.12.13 with the following 3 > > patches (from nfsd-testing) applied to it: > > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr > > - skip-sending-CB_RECALL_ANY > > - fix-cb_getattr_status-fix > > > > Frequently the load on the system goes up and top shows a lot of > > kswapd and kcompact threads next to nfsd threads. During these period > > (which can last for hours), users complain about very slow NFS access. > > We have approx 260 systems connecting to this server and the number of > > nfs client states (from the states files in the clients directory) are > > around 200000. > > Are any of these clients connecting to a server from the same host? > Only reason I ask is I fixed a recursion deadlock that manifested in > testing when memory was very low and LOCALIO used to loopback mount on > the same host. See: > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcomp= actd writeback") > https://git.kernel.org/linus/ce6d9c1c2b5cc785 > > (I suspect you aren't using NFS loopback mounts at all otherwise your > report would indicate breadcrumbs like I mentioned in my commit, > e.g. "task kcompactd0:58 blocked for more than 4435 seconds"). Normally the server does not NFS mount itself. We also don't have any "blocked task" messages reported in dmesg. > > > When I look at our monitoring logs, the system has frequent direct > > reclaim stalls (allocstall_movable, and some allocstall_normal) and > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly > > is about 50. So it seems the system is out of memory and is constantly > > trying to free pages? If I understand it correctly the system hits a > > threshold which makes it scan for pages to free, frees some pages and > > when it stops it very quickly hits the low watermark again? > > > > But the system has over 150G of memory dedicated to cache, and > > slab_reclaim is only about 16G. Why is the system not dropping more > > caches to free memory instead of constantly looking to free memory? Is > > there a tunable that we can set so the system will prefer to drop > > caches and increase memory usage for other nfsd related things? Any > > tips on how to debug where the memory pressure is coming from, or why > > the system decides to keep the pages used for cache instead of freeing > > some of those? The issue is currently not happening, but I've looked at some of our sar statistics from today: # sar -B 04:00:00 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 04:00:00 PM 6570.43 37504.61 1937.60 0.20 337274.24 10817339.49 0.00 10623.60 0.10 04:10:03 PM 6266.09 28821.33 4392.91 0.65 266336.28 8464619.82 0.00 7756.98 0.09 04:20:05 PM 6894.44 33790.76 12713.86 1.86 271167.36 9689653.88 0.00 8123.21 0.08 04:30:03 PM 6839.52 24451.70 1693.22 0.76 237536.27 9268350.05 11.73 5339.54 0.06 04:40:05 PM 6197.73 28958.02 4260.95 0.33 306245.10 9797882.50 0.00 7892.46 0.08 04:50:02 PM 4252.11 31658.28 1849.64 0.58 297727.92 6885422.57 0.00 7541.08 0.11 # sar -r 04:00:00 PM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 04:00:00 PM 3942896 180501232 2652336 1.35 29594476 138477148 3949924 1.50 48038428 120797592 13324 04:10:03 PM 4062416 180601484 2564852 1.31 29574180 138589324 3974652 1.51 47664880 121277920 157472 04:20:05 PM 4131172 180150888 3013128 1.54 29669384 138076684 3969232 1.51 47325688 121184212 4448 04:30:03 PM 4112388 180835756 2344936 1.20 30338956 138145972 3883420 1.48 49014976 120205032 5072 04:40:05 PM 3892332 179390408 3428992 1.75 30559972 137103196 3852380 1.46 48939020 119461684 306336 04:50:02 PM 4328220 180002072 3197120 1.63 30873116 136567640 3891224 1.48 49335740 118841092 3412 # sar -W 04:00:00 PM pswpin/s pswpout/s 04:00:00 PM 0.09 0.29 04:10:03 PM 0.33 0.60 04:20:05 PM 0.20 0.38 04:30:03 PM 0.69 0.33 04:40:05 PM 0.36 0.72 04:50:02 PM 0.30 0.46 If I read this correctly, the systems is scanning scanning for free pages (pgscand) and freeing some of them (pgfree), but the efficiency is low (%vmeff). At the same time, the amount of memory used (kbmemused / %memused) is quite low as most of the memory is used as cache. There's approx 120G of inactive memory. So I'm at loss as to why the system is performing these page scans and stalling instead of dropping some of the cache and using that instead. > > All good questions, to which I don't have immediate answers (but > others may). > > Just FYI: there is a slow-start development TODO to leverage 6.14's > "DONTCACHE" support (particularly in nfsd, but client might benefit > some too) to avoid nfsd writeback stalls due to memory being > fragmented and reclaim having to work too hard (in concert with > kcompactd) to find adequate pages. > > > I've ran a perf record for 10s and the top 4 of the events seem to be: > > > > 1. 54% is swapper in intel_idle_ibrs > > 2. 12% is swapper in intel_idle > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath: > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report > > 10s is pretty short... might consider a longer sample and then use the > perf.data to generate a flamegraph, e.g.: > > - Download Flamegraph project: git clone https://github.com/brendangregg/= FlameGraph > you will likely need to install some missing deps, e.g.: > yum install perl-open.noarch > - export FLAME=3D/root/git/FlameGraph > - perf record -F 99 -a -g sleep 120 > - this will generate a perf.data output file. > > Once you have perf.data output, generate a flamegraph file (named > perf.svg) using these 2 commands: > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded > $FLAME/flamegraph.pl out.perf-folded > perf.svg > > Open the perf.svg image with your favorite image viewer (a web browser > works well). > > I just find flamegraph way more useful than 'perf report' ranked > ordering. That's a very good idea, thanks. I will try that when the issue returns. > > > Are there any know memory management changes related to NFS that have > > been introduced that could explain this behavior? What steps can I > > take to debug the root cause of this? Looking at iftop there isn't > > much going on regarding throughput. The top 3 NFS4 server operations > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s). > > You'd likely do well to expand the audience to include MM too (now cc'd). Thanks. All ideas on how I can determine the root cause of this is apprecia= ted. Regards, Rik