From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45D34C36010 for ; Tue, 1 Apr 2025 20:44:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E8AE280002; Tue, 1 Apr 2025 16:44:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 196DF280001; Tue, 1 Apr 2025 16:44:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 036C2280002; Tue, 1 Apr 2025 16:44:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D7AD9280001 for ; Tue, 1 Apr 2025 16:44:14 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E9A77160906 for ; Tue, 1 Apr 2025 20:44:15 +0000 (UTC) X-FDA: 83286652470.30.5EBEB67 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf05.hostedemail.com (Postfix) with ESMTP id F0DFF10000D for ; Tue, 1 Apr 2025 20:44:13 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=B7KU7ZLY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743540254; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=; b=neZSdCIi9ey+K7bZNOTebF3zR53ue27mG03k2TGzEglsWBWxbcys25tOkE6qnOfH6+BVdU O5SXLRtV1RzAcSmARJluH2rRiyg+fx0BcGGZlc/X71ayUApNf58jE06kH70V1Oec2vu+PW y9C33AM8ITdQtOgi6fBLH7Qr7RLToNs= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=B7KU7ZLY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of rik.theys@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=rik.theys@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743540254; a=rsa-sha256; cv=none; b=Ro/NRl0pYRMSocn6G4pTdPkbWJcnjMizDCPLdCTK10q40gsQY7eLttvFZnTBtzgE+sG2Xa Eb8hnOF6LTx9Q65c8g0yIB+VgRJsTcAyQhw3y4DuD+7QQqdrSV4WHbhGVUR/XRHFqfuLlH 8jad9hRvGxlMmSPhninDp+AG7i6MH6g= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5e60cfef9cfso9724803a12.2 for ; Tue, 01 Apr 2025 13:44:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743540252; x=1744145052; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=; b=B7KU7ZLY7gihAEa05l7KTyCrDgO3imrp4RjHsZ29SCv2VWTwYSzJI6EircFoRNv5Mm Gq4IVsgU7b2o7zydoaU90l4/RIqraqFXZDBRAyMoz4YsX0Om7BoIibVPGuXJVa+rLYWr TJSO0bNlSL865W7Oe1yIWqmz7/PTOx77YDd0msIzCUVgJAEcVngR36QHW0J+Hpqdir/B LUHGaei9ecrogZ48psyFrM158086y8iW3xch6tAu3qiQoPYzaGI+UYqqYlq2KMDZhaS7 TOIRSkJlx3g9C8x0SyFgRRIlAOUCZWg53qzXIdedj7w2fIn2mSm5dJ0DP01ucm8s0eXo XXyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743540252; x=1744145052; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8o08cwUPKURl79MOO27l+yFqUkA3MVoVqrdPYzQIyUo=; b=g4HKfMoEChZ+olESFK5OBLDhE9H9oZTYIDVgDs/fctfRwlZdGgQyiVCfg0te+YU0wG UX2AA/B6XgicosnOPDyLl/6+imTO9TaoBqV2o8WMcUsIFzDgf9IUflsNLKT0oev2VaWj ZiPsIgcMStZodGUkz1i0SvEvyDl+F9P1cZ//kh9L5MUcs6T0vP3PMgJmZSCtqvhx2Sqi PaGZisrg7frDdZx/l2PYstwUrrwBBev8d+0Xz3yL+UUt9dZzAyP39ROzVreYChQ7YF/r 3VuL/kK6hbfyotLaA1No+xr7sm98hiHqcpQv6FO7G+k838s6oXuE4xvU0sNa2M93jPt+ G8jA== X-Forwarded-Encrypted: i=1; AJvYcCWKs0KtYQ2MTNdm3QSTnd/VTJFnvwmL5gAuvy+7PvSch3DORtr2f3RYwzK4i4ytfuWgT6DAgL15Ww==@kvack.org X-Gm-Message-State: AOJu0YzRKra7aj0Rdx8Ivx+6IxHMxKJyy4TxFp6U5G7kgX9aso/kY+uC hmH6Uxl0kyqecX5Q3kXT1LyWCawaRKhHEhE/16okJORhh/4OHvtuHDcvNamKStrZ2YCMlMfQQVo jUQ92V5ZGx3SlFLAF4wiXCKHd2a4= X-Gm-Gg: ASbGnctkTxogKcuStgsTxjCCM6XEozfxW0dbaG7HSigG2UPVzG/Q5kgHAiWOsSUhl24 p4889sb3eG7SimlTIZ/+2pz/uD1egUPggNA8GdSFbYhK8zbnML1EyWvhMumPGt/zZK/B4CJtXFb qN7L9ijrwf+dhJr9SwugBi/3XySk9mIma9TXJ7wXC9 X-Google-Smtp-Source: AGHT+IFZ3vDKfqYFeqkW+L3WitrmVCiANf6CjE0w5FK7PrCTzjnLSSoG0Wx+WnxQAbG0U+Y4gQ9+n2JtRafkPrUhvjg= X-Received: by 2002:a05:6402:84f:b0:5e8:bced:9ee5 with SMTP id 4fb4d7f45d1cf-5edfd1229f5mr10788631a12.18.1743540251896; Tue, 01 Apr 2025 13:44:11 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Rik Theys Date: Tue, 1 Apr 2025 22:43:59 +0200 X-Gm-Features: AQ5f1JpNIrY7WUSSiGyWsmBcRmhjoUI3KasqHEcexx5C2__qAT90D6XAeJZP8xs Message-ID: Subject: Re: Memory reclaim and high nfsd usage To: Mike Snitzer Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: F0DFF10000D X-Stat-Signature: 1474jcpqgue3ejj6wmm4osc3f64ny3oj X-Rspam-User: X-HE-Tag: 1743540253-502176 X-HE-Meta: U2FsdGVkX1/F8nu0w4XoIqOU4qknnE31WdUH4A/kqkH9M8D3Owr4QcT3pZK+3qa6geoPJHNkq1mBr7scnbSmMguiKVKW9ZOLSjfMCfPBVlxHaKcwcWoJvz0jONVtuKNh9YUYQWUDENz/n4cdXM4Sc2F+rYFZeunh39p6aWqSnqqz3GK/9Yo+AARKbBo1CwKWqlVEgx+yNKE505HXq10DI1f2U4Bek7IBi1DiZhB+nExlnc5sDydXH19vTjfyiUXtfpS7AwF2MzbbgVgI/7xwrHMWg2EwF9I/3jXKABuNyazYHKdOkc4OwUmDc3SSfWMNGUWVXBu2YrEDjg5omiQcq3lSGkaobAwIx6YnOueS8PqcE3FNwhHIYldL8ZbXur4sCTU5EEJHJIZaYbOsCyioSARqczdhGRS3yW2wBdAKBKoal8UqG/f/Alg7LWl0HXdHG65SIeYgo/sQlCCyLgPEHUW882RPYFE5ZInePvhky3yGiBhfge9zsGkpExL22tzRC8mP7EpbpP/F+P7oNxOgjOQU5MydBe0YtcSZ9EVoKd/f8IpHiOdaBvJA6cGFnQhXXtvr24z9w32CGJJnuQ36LqUPgkiXRP2hS5xQHpNfJ7l9ttIYf1K5U6VQ2Pql3Wf1Dxba5TxwNeYCavrxOpQ3UNPV9If//u3C5XyeWGMPJaE+ySf3B0aAsHMlnOml/qkIUh6GBzGI5GSORh5Jx6vSSajY+Z7woQpXfYjYEm2mym8b80ydiE7K9F6aFZeiVNtZkw68noPY/kUVWkBhSazqwYBFa2TqrsGqcE4Lodyd1fbHX1kTFFAkj2NzhtwnULQDNZBc4YAqEi4vVbgknVFIpKldoi0AdgK1yEtewG5L95uHbUokesRGd9uRZeK7R+/gYlT6rWsjo4jbWKlAiOjnDxKLYcdba0JpxENagtlK6oF/hDovQwNDoATUhMnqHpgh93QhDZS2kgXCNJxn6Ry mn65nh5A dLQFRNLKPQ7MVs9Z3k3pnNemNjTP3hfKRREHnyKT7eGI06GgZj4e6798ZL7VSjh0r7U9W2ZL11rwcWYif+1ya/aMVxRB1zI3iyH9tNaSohtMsIw/ZAqn3g9Tjul4zIGHrfSb6MM5gWe1dFCIEWi+75XlqgnHQDpxj0KXZ2VYDIGneqm/+PAVWFHqy7BOaefCzsKKHBI35VVBoTDF82oAQZc51AZS8GVQb5vEQw+z1uk58fbTL6emeTzwe7h4UaX7EZvmWhPjJOWuZjiFB638B4+9ZtZwzYXjNtXGiLYdwC5p5rGdtsFRjJOtlNKozcxx3PhU485nYf++W588d0ADN35jcA+clzcWFJNRMKoYqvMGMx/864TyZTWLCeX8a3dbCyHAj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On Tue, Apr 1, 2025 at 10:07=E2=80=AFPM Rik Theys wro= te: > > Hi, > > On Tue, Apr 1, 2025 at 9:31=E2=80=AFPM Rik Theys wr= ote: > > > > Hi, > > > > On Tue, Apr 1, 2025 at 7:40=E2=80=AFPM Mike Snitzer wrote: > > > > > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote: > > > > Hi, > > > > > > > > Our fileserver is currently running 6.12.13 with the following 3 > > > > patches (from nfsd-testing) applied to it: > > > > > > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr > > > > - skip-sending-CB_RECALL_ANY > > > > - fix-cb_getattr_status-fix > > > > > > > > Frequently the load on the system goes up and top shows a lot of > > > > kswapd and kcompact threads next to nfsd threads. During these peri= od > > > > (which can last for hours), users complain about very slow NFS acce= ss. > > > > We have approx 260 systems connecting to this server and the number= of > > > > nfs client states (from the states files in the clients directory) = are > > > > around 200000. > > > > > > Are any of these clients connecting to a server from the same host? > > > Only reason I ask is I fixed a recursion deadlock that manifested in > > > testing when memory was very low and LOCALIO used to loopback mount o= n > > > the same host. See: > > > > > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via k= compactd writeback") > > > https://git.kernel.org/linus/ce6d9c1c2b5cc785 > > > > > > (I suspect you aren't using NFS loopback mounts at all otherwise your > > > report would indicate breadcrumbs like I mentioned in my commit, > > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds"). > > > > Normally the server does not NFS mount itself. We also don't have any > > "blocked task" messages reported in dmesg. > > > > > > > > > When I look at our monitoring logs, the system has frequent direct > > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and > > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickl= y > > > > is about 50. So it seems the system is out of memory and is constan= tly > > > > trying to free pages? If I understand it correctly the system hits = a > > > > threshold which makes it scan for pages to free, frees some pages a= nd > > > > when it stops it very quickly hits the low watermark again? > > > > > > > > But the system has over 150G of memory dedicated to cache, and > > > > slab_reclaim is only about 16G. Why is the system not dropping more > > > > caches to free memory instead of constantly looking to free memory?= Is > > > > there a tunable that we can set so the system will prefer to drop > > > > caches and increase memory usage for other nfsd related things? Any > > > > tips on how to debug where the memory pressure is coming from, or w= hy > > > > the system decides to keep the pages used for cache instead of free= ing > > > > some of those? Could this be related to https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.= git/commit/?h=3Dlinux-6.12.y&id=3De21ce310556ec40b5b2987e02d12ca7109a33a61: mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream. This is fixed in a later 6.12.x kernel, but we're still running 6.12.13 currently. Regards, Rik > > > > The issue is currently not happening, but I've looked at some of our > > sar statistics from today: > > > > # sar -B > > 04:00:00 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s > > pgscank/s pgscand/s pgsteal/s %vmeff > > 04:00:00 PM 6570.43 37504.61 1937.60 0.20 337274.24 > > 10817339.49 0.00 10623.60 0.10 > > 04:10:03 PM 6266.09 28821.33 4392.91 0.65 266336.28 > > 8464619.82 0.00 7756.98 0.09 > > 04:20:05 PM 6894.44 33790.76 12713.86 1.86 271167.36 > > 9689653.88 0.00 8123.21 0.08 > > 04:30:03 PM 6839.52 24451.70 1693.22 0.76 237536.27 > > 9268350.05 11.73 5339.54 0.06 > > 04:40:05 PM 6197.73 28958.02 4260.95 0.33 306245.10 > > 9797882.50 0.00 7892.46 0.08 > > 04:50:02 PM 4252.11 31658.28 1849.64 0.58 297727.92 > > 6885422.57 0.00 7541.08 0.11 > > > > # sar -r > > 04:00:00 PM kbmemfree kbavail kbmemused %memused kbbuffers > > kbcached kbcommit %commit kbactive kbinact kbdirty > > 04:00:00 PM 3942896 180501232 2652336 1.35 29594476 > > 138477148 3949924 1.50 48038428 120797592 13324 > > 04:10:03 PM 4062416 180601484 2564852 1.31 29574180 > > 138589324 3974652 1.51 47664880 121277920 157472 > > 04:20:05 PM 4131172 180150888 3013128 1.54 29669384 > > 138076684 3969232 1.51 47325688 121184212 4448 > > 04:30:03 PM 4112388 180835756 2344936 1.20 30338956 > > 138145972 3883420 1.48 49014976 120205032 5072 > > 04:40:05 PM 3892332 179390408 3428992 1.75 30559972 > > 137103196 3852380 1.46 48939020 119461684 306336 > > 04:50:02 PM 4328220 180002072 3197120 1.63 30873116 > > 136567640 3891224 1.48 49335740 118841092 3412 > > > > # sar -W > > 04:00:00 PM pswpin/s pswpout/s > > 04:00:00 PM 0.09 0.29 > > 04:10:03 PM 0.33 0.60 > > 04:20:05 PM 0.20 0.38 > > 04:30:03 PM 0.69 0.33 > > 04:40:05 PM 0.36 0.72 > > 04:50:02 PM 0.30 0.46 > > > > If I read this correctly, the systems is scanning scanning for free > > pages (pgscand) and freeing some of them (pgfree), but the efficiency > > is low (%vmeff). > > At the same time, the amount of memory used (kbmemused / %memused) is > > quite low as most of the memory is used as cache. There's approx 120G > > of inactive memory. > > So I'm at loss as to why the system is performing these page scans and > > stalling instead of dropping some of the cache and using that instead. > > > > > > > > All good questions, to which I don't have immediate answers (but > > > others may). > > > > > > Just FYI: there is a slow-start development TODO to leverage 6.14's > > > "DONTCACHE" support (particularly in nfsd, but client might benefit > > > some too) to avoid nfsd writeback stalls due to memory being > > > fragmented and reclaim having to work too hard (in concert with > > > kcompactd) to find adequate pages. > > > > > > > I've ran a perf record for 10s and the top 4 of the events seem to = be: > > > > > > > > 1. 54% is swapper in intel_idle_ibrs > > > > 2. 12% is swapper in intel_idle > > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath: > > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report > > > > > > 10s is pretty short... might consider a longer sample and then use th= e > > > perf.data to generate a flamegraph, e.g.: > > > > > > - Download Flamegraph project: git clone https://github.com/brendangr= egg/FlameGraph > > > you will likely need to install some missing deps, e.g.: > > > yum install perl-open.noarch > > > - export FLAME=3D/root/git/FlameGraph > > > - perf record -F 99 -a -g sleep 120 > > > - this will generate a perf.data output file. > > > > > > Once you have perf.data output, generate a flamegraph file (named > > > perf.svg) using these 2 commands: > > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded > > > $FLAME/flamegraph.pl out.perf-folded > perf.svg > > > > > > Open the perf.svg image with your favorite image viewer (a web browse= r > > > works well). > > > > > > I just find flamegraph way more useful than 'perf report' ranked > > > ordering. > > > > That's a very good idea, thanks. I will try that when the issue returns= . > > The kswapd process started to consume some cpu again, so I've followed > this procedure. See the file in attach. > > Does this show some sort of locking contention? > > Regards, > Rik > > > > > > > > > > Are there any know memory management changes related to NFS that ha= ve > > > > been introduced that could explain this behavior? What steps can I > > > > take to debug the root cause of this? Looking at iftop there isn't > > > > much going on regarding throughput. The top 3 NFS4 server operation= s > > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s). > > > > > > You'd likely do well to expand the audience to include MM too (now cc= 'd). > > > > Thanks. All ideas on how I can determine the root cause of this is appr= eciated. > > > > > > Regards, > > Rik > > > > -- > > Rik --=20 Rik