Re: Memory reclaim and high nfsd usage

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Memory reclaim and high nfsd usage
       [not found] <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
@ 2025-04-01 17:40 ` Mike Snitzer
  2025-04-01 19:31   ` Rik Theys
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Snitzer @ 2025-04-01 17:40 UTC (permalink / raw)
  To: Rik Theys; +Cc: linux-nfs, linux-mm

On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> Hi,
> 
> Our fileserver is currently running 6.12.13 with the following 3
> patches (from nfsd-testing) applied to it:
> 
> - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> - skip-sending-CB_RECALL_ANY
> - fix-cb_getattr_status-fix
> 
> Frequently the load on the system goes up and top shows a lot of
> kswapd and kcompact threads next to nfsd threads. During these period
> (which can last for hours), users complain about very slow NFS access.
> We have approx 260 systems connecting to this server and the number of
> nfs client states (from the states files in the clients directory) are
> around 200000.

Are any of these clients connecting to a server from the same host?
Only reason I ask is I fixed a recursion deadlock that manifested in
testing when memory was very low and LOCALIO used to loopback mount on
the same host.  See:

ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
https://git.kernel.org/linus/ce6d9c1c2b5cc785

(I suspect you aren't using NFS loopback mounts at all otherwise your
report would indicate breadcrumbs like I mentioned in my commit,
e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
 
> When I look at our monitoring logs, the system has frequent direct
> reclaim stalls (allocstall_movable, and some allocstall_normal) and
> pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> is about 50. So it seems the system is out of memory and is constantly
> trying to free pages? If I understand it correctly the system hits a
> threshold which makes it scan for pages to free, frees some pages and
> when it stops it very quickly hits the low watermark again?
> 
> But the system has over 150G of memory dedicated to cache, and
> slab_reclaim is only about 16G. Why is the system not dropping more
> caches to free memory instead of constantly looking to free memory? Is
> there a tunable that we can set so the system will prefer to drop
> caches and increase memory usage for other nfsd related things? Any
> tips on how to debug where the memory pressure is coming from, or why
> the system decides to keep the pages used for cache instead of freeing
> some of those?

All good questions, to which I don't have immediate answers (but
others may).

Just FYI: there is a slow-start development TODO to leverage 6.14's
"DONTCACHE" support (particularly in nfsd, but client might benefit
some too) to avoid nfsd writeback stalls due to memory being
fragmented and reclaim having to work too hard (in concert with
kcompactd) to find adequate pages.

> I've ran a perf record for 10s and the top 4 of the events seem to be:
> 
> 1. 54% is swapper in intel_idle_ibrs
> 2. 12% is swapper in intel_idle
> 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> 4. 5% is kswapd0 in __list_del_entry_valid_or_report

10s is pretty short... might consider a longer sample and then use the
perf.data to generate a flamegraph, e.g.:

- Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
  you will likely need to install some missing deps, e.g.:
  yum install perl-open.noarch
- export FLAME=/root/git/FlameGraph
- perf record -F 99 -a -g sleep 120
  - this will generate a perf.data output file.

Once you have perf.data output, generate a flamegraph file (named
perf.svg) using these 2 commands:
perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
$FLAME/flamegraph.pl out.perf-folded > perf.svg

Open the perf.svg image with your favorite image viewer (a web browser
works well).

I just find flamegraph way more useful than 'perf report' ranked
ordering.
 
> Are there any know memory management changes related to NFS that have
> been introduced that could explain this behavior? What steps can I
> take to debug the root cause of this? Looking at iftop there isn't
> much going on regarding throughput. The top 3 NFS4 server operations
> are sequence 9563/s), putfh(9032/s) and getattr (7150/s).

You'd likely do well to expand the audience to include MM too (now cc'd).


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory reclaim and high nfsd usage
  2025-04-01 17:40 ` Memory reclaim and high nfsd usage Mike Snitzer
@ 2025-04-01 19:31   ` Rik Theys
  2025-04-01 20:07     ` Rik Theys
  0 siblings, 1 reply; 4+ messages in thread
From: Rik Theys @ 2025-04-01 19:31 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nfs, linux-mm

Hi,

On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > Hi,
> >
> > Our fileserver is currently running 6.12.13 with the following 3
> > patches (from nfsd-testing) applied to it:
> >
> > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > - skip-sending-CB_RECALL_ANY
> > - fix-cb_getattr_status-fix
> >
> > Frequently the load on the system goes up and top shows a lot of
> > kswapd and kcompact threads next to nfsd threads. During these period
> > (which can last for hours), users complain about very slow NFS access.
> > We have approx 260 systems connecting to this server and the number of
> > nfs client states (from the states files in the clients directory) are
> > around 200000.
>
> Are any of these clients connecting to a server from the same host?
> Only reason I ask is I fixed a recursion deadlock that manifested in
> testing when memory was very low and LOCALIO used to loopback mount on
> the same host.  See:
>
> ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> https://git.kernel.org/linus/ce6d9c1c2b5cc785
>
> (I suspect you aren't using NFS loopback mounts at all otherwise your
> report would indicate breadcrumbs like I mentioned in my commit,
> e.g. "task kcompactd0:58 blocked for more than 4435 seconds").

Normally the server does not NFS mount itself. We also don't have any
"blocked task" messages reported in dmesg.

>
> > When I look at our monitoring logs, the system has frequent direct
> > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > is about 50. So it seems the system is out of memory and is constantly
> > trying to free pages? If I understand it correctly the system hits a
> > threshold which makes it scan for pages to free, frees some pages and
> > when it stops it very quickly hits the low watermark again?
> >
> > But the system has over 150G of memory dedicated to cache, and
> > slab_reclaim is only about 16G. Why is the system not dropping more
> > caches to free memory instead of constantly looking to free memory? Is
> > there a tunable that we can set so the system will prefer to drop
> > caches and increase memory usage for other nfsd related things? Any
> > tips on how to debug where the memory pressure is coming from, or why
> > the system decides to keep the pages used for cache instead of freeing
> > some of those?

The issue is currently not happening, but I've looked at some of our
sar statistics from today:

# sar -B
04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
pgscank/s pgscand/s pgsteal/s    %vmeff
04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
10817339.49      0.00  10623.60      0.10
04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
8464619.82      0.00   7756.98      0.09
04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
9689653.88      0.00   8123.21      0.08
04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
9268350.05     11.73   5339.54      0.06
04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
9797882.50      0.00   7892.46      0.08
04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
6885422.57      0.00   7541.08      0.11

# sar -r
04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
04:00:00 PM   3942896 180501232   2652336      1.35  29594476
138477148   3949924      1.50  48038428 120797592     13324
04:10:03 PM   4062416 180601484   2564852      1.31  29574180
138589324   3974652      1.51  47664880 121277920    157472
04:20:05 PM   4131172 180150888   3013128      1.54  29669384
138076684   3969232      1.51  47325688 121184212      4448
04:30:03 PM   4112388 180835756   2344936      1.20  30338956
138145972   3883420      1.48  49014976 120205032      5072
04:40:05 PM   3892332 179390408   3428992      1.75  30559972
137103196   3852380      1.46  48939020 119461684    306336
04:50:02 PM   4328220 180002072   3197120      1.63  30873116
136567640   3891224      1.48  49335740 118841092      3412

# sar -W
04:00:00 PM  pswpin/s pswpout/s
04:00:00 PM      0.09      0.29
04:10:03 PM      0.33      0.60
04:20:05 PM      0.20      0.38
04:30:03 PM      0.69      0.33
04:40:05 PM      0.36      0.72
04:50:02 PM      0.30      0.46

If I read this correctly, the systems is scanning scanning for free
pages (pgscand) and freeing some of them (pgfree), but the efficiency
is low (%vmeff).
At the same time, the amount of memory used (kbmemused / %memused) is
quite low as most of the memory is used as cache. There's approx 120G
of inactive memory.
So I'm at loss as to why the system is performing these page scans and
stalling instead of dropping some of the cache and using that instead.

>
> All good questions, to which I don't have immediate answers (but
> others may).
>
> Just FYI: there is a slow-start development TODO to leverage 6.14's
> "DONTCACHE" support (particularly in nfsd, but client might benefit
> some too) to avoid nfsd writeback stalls due to memory being
> fragmented and reclaim having to work too hard (in concert with
> kcompactd) to find adequate pages.
>
> > I've ran a perf record for 10s and the top 4 of the events seem to be:
> >
> > 1. 54% is swapper in intel_idle_ibrs
> > 2. 12% is swapper in intel_idle
> > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
>
> 10s is pretty short... might consider a longer sample and then use the
> perf.data to generate a flamegraph, e.g.:
>
> - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
>   you will likely need to install some missing deps, e.g.:
>   yum install perl-open.noarch
> - export FLAME=/root/git/FlameGraph
> - perf record -F 99 -a -g sleep 120
>   - this will generate a perf.data output file.
>
> Once you have perf.data output, generate a flamegraph file (named
> perf.svg) using these 2 commands:
> perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> $FLAME/flamegraph.pl out.perf-folded > perf.svg
>
> Open the perf.svg image with your favorite image viewer (a web browser
> works well).
>
> I just find flamegraph way more useful than 'perf report' ranked
> ordering.

That's a very good idea, thanks. I will try that when the issue returns.

>
> > Are there any know memory management changes related to NFS that have
> > been introduced that could explain this behavior? What steps can I
> > take to debug the root cause of this? Looking at iftop there isn't
> > much going on regarding throughput. The top 3 NFS4 server operations
> > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
>
> You'd likely do well to expand the audience to include MM too (now cc'd).

Thanks. All ideas on how I can determine the root cause of this is appreciated.


Regards,
Rik


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory reclaim and high nfsd usage
  2025-04-01 19:31   ` Rik Theys
@ 2025-04-01 20:07     ` Rik Theys
  2025-04-01 20:43       ` Rik Theys
  0 siblings, 1 reply; 4+ messages in thread
From: Rik Theys @ 2025-04-01 20:07 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7733 bytes --]

Hi,

On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@gmail.com> wrote:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@kernel.org> wrote:
> >
> > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > Hi,
> > >
> > > Our fileserver is currently running 6.12.13 with the following 3
> > > patches (from nfsd-testing) applied to it:
> > >
> > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > - skip-sending-CB_RECALL_ANY
> > > - fix-cb_getattr_status-fix
> > >
> > > Frequently the load on the system goes up and top shows a lot of
> > > kswapd and kcompact threads next to nfsd threads. During these period
> > > (which can last for hours), users complain about very slow NFS access.
> > > We have approx 260 systems connecting to this server and the number of
> > > nfs client states (from the states files in the clients directory) are
> > > around 200000.
> >
> > Are any of these clients connecting to a server from the same host?
> > Only reason I ask is I fixed a recursion deadlock that manifested in
> > testing when memory was very low and LOCALIO used to loopback mount on
> > the same host.  See:
> >
> > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> >
> > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > report would indicate breadcrumbs like I mentioned in my commit,
> > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
>
> Normally the server does not NFS mount itself. We also don't have any
> "blocked task" messages reported in dmesg.
>
> >
> > > When I look at our monitoring logs, the system has frequent direct
> > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > > is about 50. So it seems the system is out of memory and is constantly
> > > trying to free pages? If I understand it correctly the system hits a
> > > threshold which makes it scan for pages to free, frees some pages and
> > > when it stops it very quickly hits the low watermark again?
> > >
> > > But the system has over 150G of memory dedicated to cache, and
> > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > caches to free memory instead of constantly looking to free memory? Is
> > > there a tunable that we can set so the system will prefer to drop
> > > caches and increase memory usage for other nfsd related things? Any
> > > tips on how to debug where the memory pressure is coming from, or why
> > > the system decides to keep the pages used for cache instead of freeing
> > > some of those?
>
> The issue is currently not happening, but I've looked at some of our
> sar statistics from today:
>
> # sar -B
> 04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
> pgscank/s pgscand/s pgsteal/s    %vmeff
> 04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
> 10817339.49      0.00  10623.60      0.10
> 04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
> 8464619.82      0.00   7756.98      0.09
> 04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
> 9689653.88      0.00   8123.21      0.08
> 04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
> 9268350.05     11.73   5339.54      0.06
> 04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
> 9797882.50      0.00   7892.46      0.08
> 04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
> 6885422.57      0.00   7541.08      0.11
>
> # sar -r
> 04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
> kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
> 04:00:00 PM   3942896 180501232   2652336      1.35  29594476
> 138477148   3949924      1.50  48038428 120797592     13324
> 04:10:03 PM   4062416 180601484   2564852      1.31  29574180
> 138589324   3974652      1.51  47664880 121277920    157472
> 04:20:05 PM   4131172 180150888   3013128      1.54  29669384
> 138076684   3969232      1.51  47325688 121184212      4448
> 04:30:03 PM   4112388 180835756   2344936      1.20  30338956
> 138145972   3883420      1.48  49014976 120205032      5072
> 04:40:05 PM   3892332 179390408   3428992      1.75  30559972
> 137103196   3852380      1.46  48939020 119461684    306336
> 04:50:02 PM   4328220 180002072   3197120      1.63  30873116
> 136567640   3891224      1.48  49335740 118841092      3412
>
> # sar -W
> 04:00:00 PM  pswpin/s pswpout/s
> 04:00:00 PM      0.09      0.29
> 04:10:03 PM      0.33      0.60
> 04:20:05 PM      0.20      0.38
> 04:30:03 PM      0.69      0.33
> 04:40:05 PM      0.36      0.72
> 04:50:02 PM      0.30      0.46
>
> If I read this correctly, the systems is scanning scanning for free
> pages (pgscand) and freeing some of them (pgfree), but the efficiency
> is low (%vmeff).
> At the same time, the amount of memory used (kbmemused / %memused) is
> quite low as most of the memory is used as cache. There's approx 120G
> of inactive memory.
> So I'm at loss as to why the system is performing these page scans and
> stalling instead of dropping some of the cache and using that instead.
>
> >
> > All good questions, to which I don't have immediate answers (but
> > others may).
> >
> > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > some too) to avoid nfsd writeback stalls due to memory being
> > fragmented and reclaim having to work too hard (in concert with
> > kcompactd) to find adequate pages.
> >
> > > I've ran a perf record for 10s and the top 4 of the events seem to be:
> > >
> > > 1. 54% is swapper in intel_idle_ibrs
> > > 2. 12% is swapper in intel_idle
> > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> >
> > 10s is pretty short... might consider a longer sample and then use the
> > perf.data to generate a flamegraph, e.g.:
> >
> > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
> >   you will likely need to install some missing deps, e.g.:
> >   yum install perl-open.noarch
> > - export FLAME=/root/git/FlameGraph
> > - perf record -F 99 -a -g sleep 120
> >   - this will generate a perf.data output file.
> >
> > Once you have perf.data output, generate a flamegraph file (named
> > perf.svg) using these 2 commands:
> > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> >
> > Open the perf.svg image with your favorite image viewer (a web browser
> > works well).
> >
> > I just find flamegraph way more useful than 'perf report' ranked
> > ordering.
>
> That's a very good idea, thanks. I will try that when the issue returns.

The kswapd process started to consume some cpu again, so I've followed
this procedure. See the file in attach.

Does this show some sort of locking contention?

Regards,
Rik

>
> >
> > > Are there any know memory management changes related to NFS that have
> > > been introduced that could explain this behavior? What steps can I
> > > take to debug the root cause of this? Looking at iftop there isn't
> > > much going on regarding throughput. The top 3 NFS4 server operations
> > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> >
> > You'd likely do well to expand the audience to include MM too (now cc'd).
>
> Thanks. All ideas on how I can determine the root cause of this is appreciated.
>
>
> Regards,
> Rik



-- 

Rik

[-- Attachment #2: perf2.svg --]
[-- Type: image/svg+xml, Size: 384238 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory reclaim and high nfsd usage
  2025-04-01 20:07     ` Rik Theys
@ 2025-04-01 20:43       ` Rik Theys
  0 siblings, 0 replies; 4+ messages in thread
From: Rik Theys @ 2025-04-01 20:43 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nfs, linux-mm

Hi,

On Tue, Apr 1, 2025 at 10:07 PM Rik Theys <rik.theys@gmail.com> wrote:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@gmail.com> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@kernel.org> wrote:
> > >
> > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > > Hi,
> > > >
> > > > Our fileserver is currently running 6.12.13 with the following 3
> > > > patches (from nfsd-testing) applied to it:
> > > >
> > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > > - skip-sending-CB_RECALL_ANY
> > > > - fix-cb_getattr_status-fix
> > > >
> > > > Frequently the load on the system goes up and top shows a lot of
> > > > kswapd and kcompact threads next to nfsd threads. During these period
> > > > (which can last for hours), users complain about very slow NFS access.
> > > > We have approx 260 systems connecting to this server and the number of
> > > > nfs client states (from the states files in the clients directory) are
> > > > around 200000.
> > >
> > > Are any of these clients connecting to a server from the same host?
> > > Only reason I ask is I fixed a recursion deadlock that manifested in
> > > testing when memory was very low and LOCALIO used to loopback mount on
> > > the same host.  See:
> > >
> > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> > > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> > >
> > > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > > report would indicate breadcrumbs like I mentioned in my commit,
> > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
> >
> > Normally the server does not NFS mount itself. We also don't have any
> > "blocked task" messages reported in dmesg.
> >
> > >
> > > > When I look at our monitoring logs, the system has frequent direct
> > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > > > is about 50. So it seems the system is out of memory and is constantly
> > > > trying to free pages? If I understand it correctly the system hits a
> > > > threshold which makes it scan for pages to free, frees some pages and
> > > > when it stops it very quickly hits the low watermark again?
> > > >
> > > > But the system has over 150G of memory dedicated to cache, and
> > > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > > caches to free memory instead of constantly looking to free memory? Is
> > > > there a tunable that we can set so the system will prefer to drop
> > > > caches and increase memory usage for other nfsd related things? Any
> > > > tips on how to debug where the memory pressure is coming from, or why
> > > > the system decides to keep the pages used for cache instead of freeing
> > > > some of those?

Could this be related to
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/commit/?h=linux-6.12.y&id=e21ce310556ec40b5b2987e02d12ca7109a33a61:

mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream.

This is fixed in a later 6.12.x kernel, but we're still running
6.12.13 currently.

Regards,
Rik

> >
> > The issue is currently not happening, but I've looked at some of our
> > sar statistics from today:
> >
> > # sar -B
> > 04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
> > pgscank/s pgscand/s pgsteal/s    %vmeff
> > 04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
> > 10817339.49      0.00  10623.60      0.10
> > 04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
> > 8464619.82      0.00   7756.98      0.09
> > 04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
> > 9689653.88      0.00   8123.21      0.08
> > 04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
> > 9268350.05     11.73   5339.54      0.06
> > 04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
> > 9797882.50      0.00   7892.46      0.08
> > 04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
> > 6885422.57      0.00   7541.08      0.11
> >
> > # sar -r
> > 04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
> > kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
> > 04:00:00 PM   3942896 180501232   2652336      1.35  29594476
> > 138477148   3949924      1.50  48038428 120797592     13324
> > 04:10:03 PM   4062416 180601484   2564852      1.31  29574180
> > 138589324   3974652      1.51  47664880 121277920    157472
> > 04:20:05 PM   4131172 180150888   3013128      1.54  29669384
> > 138076684   3969232      1.51  47325688 121184212      4448
> > 04:30:03 PM   4112388 180835756   2344936      1.20  30338956
> > 138145972   3883420      1.48  49014976 120205032      5072
> > 04:40:05 PM   3892332 179390408   3428992      1.75  30559972
> > 137103196   3852380      1.46  48939020 119461684    306336
> > 04:50:02 PM   4328220 180002072   3197120      1.63  30873116
> > 136567640   3891224      1.48  49335740 118841092      3412
> >
> > # sar -W
> > 04:00:00 PM  pswpin/s pswpout/s
> > 04:00:00 PM      0.09      0.29
> > 04:10:03 PM      0.33      0.60
> > 04:20:05 PM      0.20      0.38
> > 04:30:03 PM      0.69      0.33
> > 04:40:05 PM      0.36      0.72
> > 04:50:02 PM      0.30      0.46
> >
> > If I read this correctly, the systems is scanning scanning for free
> > pages (pgscand) and freeing some of them (pgfree), but the efficiency
> > is low (%vmeff).
> > At the same time, the amount of memory used (kbmemused / %memused) is
> > quite low as most of the memory is used as cache. There's approx 120G
> > of inactive memory.
> > So I'm at loss as to why the system is performing these page scans and
> > stalling instead of dropping some of the cache and using that instead.
> >
> > >
> > > All good questions, to which I don't have immediate answers (but
> > > others may).
> > >
> > > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > > some too) to avoid nfsd writeback stalls due to memory being
> > > fragmented and reclaim having to work too hard (in concert with
> > > kcompactd) to find adequate pages.
> > >
> > > > I've ran a perf record for 10s and the top 4 of the events seem to be:
> > > >
> > > > 1. 54% is swapper in intel_idle_ibrs
> > > > 2. 12% is swapper in intel_idle
> > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> > >
> > > 10s is pretty short... might consider a longer sample and then use the
> > > perf.data to generate a flamegraph, e.g.:
> > >
> > > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
> > >   you will likely need to install some missing deps, e.g.:
> > >   yum install perl-open.noarch
> > > - export FLAME=/root/git/FlameGraph
> > > - perf record -F 99 -a -g sleep 120
> > >   - this will generate a perf.data output file.
> > >
> > > Once you have perf.data output, generate a flamegraph file (named
> > > perf.svg) using these 2 commands:
> > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> > >
> > > Open the perf.svg image with your favorite image viewer (a web browser
> > > works well).
> > >
> > > I just find flamegraph way more useful than 'perf report' ranked
> > > ordering.
> >
> > That's a very good idea, thanks. I will try that when the issue returns.
>
> The kswapd process started to consume some cpu again, so I've followed
> this procedure. See the file in attach.
>
> Does this show some sort of locking contention?
>
> Regards,
> Rik
>
> >
> > >
> > > > Are there any know memory management changes related to NFS that have
> > > > been introduced that could explain this behavior? What steps can I
> > > > take to debug the root cause of this? Looking at iftop there isn't
> > > > much going on regarding throughput. The top 3 NFS4 server operations
> > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> > >
> > > You'd likely do well to expand the audience to include MM too (now cc'd).
> >
> > Thanks. All ideas on how I can determine the root cause of this is appreciated.
> >
> >
> > Regards,
> > Rik
>
>
>
> --
>
> Rik



-- 

Rik


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-04-01 20:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAPwv0JktC7Kb4cibSbioNAAZ9FeWs6aHeLRXDk_6MKUik1j3mg@mail.gmail.com>
2025-04-01 17:40 ` Memory reclaim and high nfsd usage Mike Snitzer
2025-04-01 19:31   ` Rik Theys
2025-04-01 20:07     ` Rik Theys
2025-04-01 20:43       ` Rik Theys

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox