From: Rik Theys <rik.theys@gmail.com>
To: Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory reclaim and high nfsd usage
Date: Tue, 1 Apr 2025 22:43:59 +0200 [thread overview]
Message-ID: <CAPwv0JmRGT9b4LeubKABWOsc97U0i6_kJyMAJQ2K7qoexSB=zA@mail.gmail.com> (raw)
In-Reply-To: <CAPwv0J=rYJJEzrXOcVtMVDc7xNeEbSPdWVis_nAGinPc=fd6ng@mail.gmail.com>
Hi,
On Tue, Apr 1, 2025 at 10:07 PM Rik Theys <rik.theys@gmail.com> wrote:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@kernel.org> wrote:
> > >
> > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > > Hi,
> > > >
> > > > Our fileserver is currently running 6.12.13 with the following 3
> > > > patches (from nfsd-testing) applied to it:
> > > >
> > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > > - skip-sending-CB_RECALL_ANY
> > > > - fix-cb_getattr_status-fix
> > > >
> > > > Frequently the load on the system goes up and top shows a lot of
> > > > kswapd and kcompact threads next to nfsd threads. During these period
> > > > (which can last for hours), users complain about very slow NFS access.
> > > > We have approx 260 systems connecting to this server and the number of
> > > > nfs client states (from the states files in the clients directory) are
> > > > around 200000.
> > >
> > > Are any of these clients connecting to a server from the same host?
> > > Only reason I ask is I fixed a recursion deadlock that manifested in
> > > testing when memory was very low and LOCALIO used to loopback mount on
> > > the same host. See:
> > >
> > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> > > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> > >
> > > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > > report would indicate breadcrumbs like I mentioned in my commit,
> > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
> >
> > Normally the server does not NFS mount itself. We also don't have any
> > "blocked task" messages reported in dmesg.
> >
> > >
> > > > When I look at our monitoring logs, the system has frequent direct
> > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > > > is about 50. So it seems the system is out of memory and is constantly
> > > > trying to free pages? If I understand it correctly the system hits a
> > > > threshold which makes it scan for pages to free, frees some pages and
> > > > when it stops it very quickly hits the low watermark again?
> > > >
> > > > But the system has over 150G of memory dedicated to cache, and
> > > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > > caches to free memory instead of constantly looking to free memory? Is
> > > > there a tunable that we can set so the system will prefer to drop
> > > > caches and increase memory usage for other nfsd related things? Any
> > > > tips on how to debug where the memory pressure is coming from, or why
> > > > the system decides to keep the pages used for cache instead of freeing
> > > > some of those?
Could this be related to
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/commit/?h=linux-6.12.y&id=e21ce310556ec40b5b2987e02d12ca7109a33a61:
mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream.
This is fixed in a later 6.12.x kernel, but we're still running
6.12.13 currently.
Regards,
Rik
> >
> > The issue is currently not happening, but I've looked at some of our
> > sar statistics from today:
> >
> > # sar -B
> > 04:00:00 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s
> > pgscank/s pgscand/s pgsteal/s %vmeff
> > 04:00:00 PM 6570.43 37504.61 1937.60 0.20 337274.24
> > 10817339.49 0.00 10623.60 0.10
> > 04:10:03 PM 6266.09 28821.33 4392.91 0.65 266336.28
> > 8464619.82 0.00 7756.98 0.09
> > 04:20:05 PM 6894.44 33790.76 12713.86 1.86 271167.36
> > 9689653.88 0.00 8123.21 0.08
> > 04:30:03 PM 6839.52 24451.70 1693.22 0.76 237536.27
> > 9268350.05 11.73 5339.54 0.06
> > 04:40:05 PM 6197.73 28958.02 4260.95 0.33 306245.10
> > 9797882.50 0.00 7892.46 0.08
> > 04:50:02 PM 4252.11 31658.28 1849.64 0.58 297727.92
> > 6885422.57 0.00 7541.08 0.11
> >
> > # sar -r
> > 04:00:00 PM kbmemfree kbavail kbmemused %memused kbbuffers
> > kbcached kbcommit %commit kbactive kbinact kbdirty
> > 04:00:00 PM 3942896 180501232 2652336 1.35 29594476
> > 138477148 3949924 1.50 48038428 120797592 13324
> > 04:10:03 PM 4062416 180601484 2564852 1.31 29574180
> > 138589324 3974652 1.51 47664880 121277920 157472
> > 04:20:05 PM 4131172 180150888 3013128 1.54 29669384
> > 138076684 3969232 1.51 47325688 121184212 4448
> > 04:30:03 PM 4112388 180835756 2344936 1.20 30338956
> > 138145972 3883420 1.48 49014976 120205032 5072
> > 04:40:05 PM 3892332 179390408 3428992 1.75 30559972
> > 137103196 3852380 1.46 48939020 119461684 306336
> > 04:50:02 PM 4328220 180002072 3197120 1.63 30873116
> > 136567640 3891224 1.48 49335740 118841092 3412
> >
> > # sar -W
> > 04:00:00 PM pswpin/s pswpout/s
> > 04:00:00 PM 0.09 0.29
> > 04:10:03 PM 0.33 0.60
> > 04:20:05 PM 0.20 0.38
> > 04:30:03 PM 0.69 0.33
> > 04:40:05 PM 0.36 0.72
> > 04:50:02 PM 0.30 0.46
> >
> > If I read this correctly, the systems is scanning scanning for free
> > pages (pgscand) and freeing some of them (pgfree), but the efficiency
> > is low (%vmeff).
> > At the same time, the amount of memory used (kbmemused / %memused) is
> > quite low as most of the memory is used as cache. There's approx 120G
> > of inactive memory.
> > So I'm at loss as to why the system is performing these page scans and
> > stalling instead of dropping some of the cache and using that instead.
> >
> > >
> > > All good questions, to which I don't have immediate answers (but
> > > others may).
> > >
> > > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > > some too) to avoid nfsd writeback stalls due to memory being
> > > fragmented and reclaim having to work too hard (in concert with
> > > kcompactd) to find adequate pages.
> > >
> > > > I've ran a perf record for 10s and the top 4 of the events seem to be:
> > > >
> > > > 1. 54% is swapper in intel_idle_ibrs
> > > > 2. 12% is swapper in intel_idle
> > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> > >
> > > 10s is pretty short... might consider a longer sample and then use the
> > > perf.data to generate a flamegraph, e.g.:
> > >
> > > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
> > > you will likely need to install some missing deps, e.g.:
> > > yum install perl-open.noarch
> > > - export FLAME=/root/git/FlameGraph
> > > - perf record -F 99 -a -g sleep 120
> > > - this will generate a perf.data output file.
> > >
> > > Once you have perf.data output, generate a flamegraph file (named
> > > perf.svg) using these 2 commands:
> > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> > >
> > > Open the perf.svg image with your favorite image viewer (a web browser
> > > works well).
> > >
> > > I just find flamegraph way more useful than 'perf report' ranked
> > > ordering.
> >
> > That's a very good idea, thanks. I will try that when the issue returns.
>
> The kswapd process started to consume some cpu again, so I've followed
> this procedure. See the file in attach.
>
> Does this show some sort of locking contention?
>
> Regards,
> Rik
>
> >
> > >
> > > > Are there any know memory management changes related to NFS that have
> > > > been introduced that could explain this behavior? What steps can I
> > > > take to debug the root cause of this? Looking at iftop there isn't
> > > > much going on regarding throughput. The top 3 NFS4 server operations
> > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> > >
> > > You'd likely do well to expand the audience to include MM too (now cc'd).
> >
> > Thanks. All ideas on how I can determine the root cause of this is appreciated.
> >
> >
> > Regards,
> > Rik
>
>
>
> --
>
> Rik
--
Rik
prev parent reply other threads:[~2025-04-01 20:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
2025-04-01 17:40 ` Mike Snitzer
2025-04-01 19:31 ` Rik Theys
2025-04-01 20:07 ` Rik Theys
2025-04-01 20:43 ` Rik Theys [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPwv0JmRGT9b4LeubKABWOsc97U0i6_kJyMAJQ2K7qoexSB=zA@mail.gmail.com' \
--to=rik.theys@gmail.com \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox