From: Yu Zhao <yuzhao@google.com>
To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Igor Raits <igor.raits@gooddata.com>,
Daniel Secik <daniel.secik@gooddata.com>,
Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
Date: Wed, 8 Nov 2023 22:48:29 -0800 [thread overview]
Message-ID: <CAOUHufb8RCBXCF_f33kO2HiEKK03nXu=W+PikfYRdnRM3kWo9w@mail.gmail.com> (raw)
In-Reply-To: <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com>
On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > Hi Jaroslav,
> > >
> > > Hi Yu Zhao
> > >
> > > thanks for response, see answers inline:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > system (16numa domains).
> > > >
> > > > Kernel version please?
> > >
> > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > (6.4.y and maybe even the 6.3.y).
> >
> > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > for you if you run into other problems with v6.6.
> >
>
> I will give it a try using 6.6.y. When it will work we can switch to
> 6.6.y instead of backporting the stuff to 6.5.y.
>
> > > > > Symptoms of my issue are
> > > > >
> > > > > /A/ if mult-gen LRU is enabled
> > > > > 1/ [kswapd3] is consuming 100% CPU
> > > >
> > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > >
> > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > 18.26, 15.01
> > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > 0.4 si, 0.0 st
> > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > ...
> > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > 34969:04 kswapd3
> > > > > ...
> > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > some kind of locking)
> > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > >
> > > > >
> > > > > /B/ if mult-gen LRU is disabled
> > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > 17.77, 14.77
> > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > 0.4 si, 0.0 st
> > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > ...
> > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > 34966:46 [kswapd3]
> > > > > ...
> > > > > 2/ swap space usage is low (4MB)
> > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > >
> > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > however the multi-gen LRU situation is 10times worse.
> > > >
> > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > both cases, the reclaim activities were as expected.
> > >
> > > I do not see a reason for the memory pressure and reclaims. This node
> > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > however the swap space usage is just 4MB (still going in and out). So
> > > what can be the reason for that behaviour?
> >
> > The best analogy is that refuel (reclaim) happens before the tank
> > becomes empty, and it happens even sooner when there is a long road
> > ahead (high order allocations).
> >
> > > The workers/application is running in pre-allocated HugePages and the
> > > rest is used for a small set of system services and drivers of
> > > devices. It is static and not growing. The issue persists when I stop
> > > the system services and free the memory.
> >
> > Yes, this helps.
> > Also could you attach /proc/buddyinfo from the moment
> > you hit the problem?
> >
>
> I can. The problem is continuous, it is 100% of time continuously
> doing in/out and consuming 100% of CPU and locking IO.
>
> The output of /proc/buddyinfo is:
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 7 2 2 1 1 2 1
> 1 1 2 1
> Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> 61 43 23 4
> Node 0, zone Normal 19 190 140 129 136 75 66
> 41 9 1 5
> Node 1, zone Normal 194 1210 2080 1800 715 255 111
> 56 42 36 55
> Node 2, zone Normal 204 768 3766 3394 1742 468 185
> 194 238 47 74
> Node 3, zone Normal 1622 2137 1058 846 388 208 97
> 44 14 42 10
Again, thinking out loud: there is only one zone on node 3, i.e., the
normal zone, and this excludes the problem commit
669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
reclaim") fixed in v6.6.
> Node 4, zone Normal 282 705 623 274 184 90 63
> 41 11 1 28
> Node 5, zone Normal 505 620 6180 3706 1724 1083 592
> 410 417 168 70
> Node 6, zone Normal 1120 357 3314 3437 2264 872 606
> 209 215 123 265
> Node 7, zone Normal 365 5499 12035 7486 3845 1743 635
> 243 309 292 78
> Node 8, zone Normal 248 740 2280 1094 1225 2087 846
> 308 192 65 55
> Node 9, zone Normal 356 763 1625 944 740 1920 1174
> 696 217 235 111
> Node 10, zone Normal 727 1479 7002 6114 2487 1084
> 407 269 157 78 16
> Node 11, zone Normal 189 3287 9141 5039 2560 1183
> 1247 693 506 252 8
> Node 12, zone Normal 142 378 1317 466 1512 1568
> 646 359 248 264 228
> Node 13, zone Normal 444 1977 3173 2625 2105 1493
> 931 600 369 266 230
> Node 14, zone Normal 376 221 120 360 2721 2378
> 1521 826 442 204 59
> Node 15, zone Normal 1210 966 922 2046 4128 2904
> 1518 744 352 102 58
>
>
> > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > pattern?
> > > >
> > > > The easiest way is to disable NUMA domain so that there would be only
> > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > has more memory and therefore they are less likely to become empty.
> > > >
> > > > > There is a free RAM in each numa node for the few MB used in
> > > > > swap:
> > > > > NUMA stats:
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > 2623 2833 2530 2269
> > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > multi-gen LRU.
> > > >
> > > > My questions:
> > > > 1. Were there any OOM kills with either case?
> > >
> > > There is no OOM. The memory usage is not growing nor the swap space
> > > usage, it is still a few MB there.
> > >
> > > > 2. Was THP enabled?
> > >
> > > Both situations with enabled and with disabled THP.
> >
> > My suspicion is that you packed the node 3 too perfectly :) And that
> > might have triggered a known but currently a low priority problem in
> > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > for me in case v6.6 by itself still has the problem?
> >
>
> I would not focus just to node3, we had issues on different servers
> with node0 and node2 both in parallel, but mostly it is the node3.
>
> How our setup looks like:
> * each node has 64GB of RAM,
> * 61GB from it is in 1GB Huge Pages,
> * rest 3GB is used by host system
>
> There are running kvm VMs vCPUs pinned to the NUMA domains and using
> the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> cpus), the qemu-kvm threads are pinned to the same numa domain as the
> vCPUs. System services are not pinned, I'm not sure why the node3 is
> used at most as the vms are balanced and the host's system services
> can move between domains.
>
> > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > produce more THPs.
> > > >
> > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > >
> > > Disabling numa is not an option. However we are now testing a setup
> > > with -1GB in HugePages per each numa.
> > >
> > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > >
> > > Not yet, the 6.6.1 was released today.
> > >
> > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > >
> > > I try disabling THP without any effect.
> >
> > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> >
> > (Also CC Charan @ Qualcomm who initially reported the problem that
> > ended up with the attached patch.)
>
> I can try it. Will let you know.
Great, thanks!
next prev parent reply other threads:[~2023-11-09 6:49 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04 ` Jaroslav Pulchart
2023-11-08 22:09 ` Yu Zhao
2023-11-09 6:39 ` Jaroslav Pulchart
2023-11-09 6:48 ` Yu Zhao [this message]
2023-11-09 10:58 ` Jaroslav Pulchart
2023-11-10 1:31 ` Yu Zhao
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09 ` Yu Zhao
2023-11-14 7:29 ` Jaroslav Pulchart
2023-11-14 7:47 ` Yu Zhao
2023-11-20 8:41 ` Jaroslav Pulchart
2023-11-22 6:13 ` Yu Zhao
2023-11-22 7:12 ` Jaroslav Pulchart
2023-11-22 7:30 ` Jaroslav Pulchart
2023-11-22 14:18 ` Yu Zhao
2023-11-29 13:54 ` Jaroslav Pulchart
2023-12-01 23:52 ` Yu Zhao
2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
2024-01-03 21:30 ` Jaroslav Pulchart
2024-01-04 3:03 ` Yu Zhao
2024-01-04 9:46 ` Jaroslav Pulchart
2024-01-04 14:34 ` Jaroslav Pulchart
2024-01-04 23:51 ` Igor Raits
2024-01-05 17:35 ` Ertman, David M
2024-01-08 17:53 ` Jaroslav Pulchart
2024-01-16 4:58 ` Yu Zhao
2024-01-16 17:34 ` Jaroslav Pulchart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAOUHufb8RCBXCF_f33kO2HiEKK03nXu=W+PikfYRdnRM3kWo9w@mail.gmail.com' \
--to=yuzhao@google.com \
--cc=akpm@linux-foundation.org \
--cc=daniel.secik@gooddata.com \
--cc=igor.raits@gooddata.com \
--cc=jaroslav.pulchart@gooddata.com \
--cc=linux-mm@kvack.org \
--cc=quic_charante@quicinc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox