From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Igor Raits <igor.raits@gooddata.com>,
Daniel Secik <daniel.secik@gooddata.com>,
Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
Date: Thu, 9 Nov 2023 07:39:15 +0100 [thread overview]
Message-ID: <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com> (raw)
In-Reply-To: <CAOUHufaxNQchy9gyPLVUq67uOcF8BkV5J93ZK5Vr+SosdXZw_g@mail.gmail.com>
>
> On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > Hi Jaroslav,
> >
> > Hi Yu Zhao
> >
> > thanks for response, see answers inline:
> >
> > >
> > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > system (16numa domains).
> > >
> > > Kernel version please?
> >
> > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > (6.4.y and maybe even the 6.3.y).
>
> v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> for you if you run into other problems with v6.6.
>
I will give it a try using 6.6.y. When it will work we can switch to
6.6.y instead of backporting the stuff to 6.5.y.
> > > > Symptoms of my issue are
> > > >
> > > > /A/ if mult-gen LRU is enabled
> > > > 1/ [kswapd3] is consuming 100% CPU
> > >
> > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > >
> > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > 18.26, 15.01
> > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > 0.4 si, 0.0 st
> > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > ...
> > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > 34969:04 kswapd3
> > > > ...
> > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > observed with swap disk as well and cause IO latency issues due to
> > > > some kind of locking)
> > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > >
> > > >
> > > > /B/ if mult-gen LRU is disabled
> > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > 17.77, 14.77
> > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > 0.4 si, 0.0 st
> > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > ...
> > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > 34966:46 [kswapd3]
> > > > ...
> > > > 2/ swap space usage is low (4MB)
> > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > >
> > > > Both situations are wrong as they are using swap in/out extensively,
> > > > however the multi-gen LRU situation is 10times worse.
> > >
> > > From the stats below, node 3 had the lowest free memory. So I think in
> > > both cases, the reclaim activities were as expected.
> >
> > I do not see a reason for the memory pressure and reclaims. This node
> > has the lowest free memory of all nodes (~302MB free) that is true,
> > however the swap space usage is just 4MB (still going in and out). So
> > what can be the reason for that behaviour?
>
> The best analogy is that refuel (reclaim) happens before the tank
> becomes empty, and it happens even sooner when there is a long road
> ahead (high order allocations).
>
> > The workers/application is running in pre-allocated HugePages and the
> > rest is used for a small set of system services and drivers of
> > devices. It is static and not growing. The issue persists when I stop
> > the system services and free the memory.
>
> Yes, this helps.
> Also could you attach /proc/buddyinfo from the moment
> you hit the problem?
>
I can. The problem is continuous, it is 100% of time continuously
doing in/out and consuming 100% of CPU and locking IO.
The output of /proc/buddyinfo is:
# cat /proc/buddyinfo
Node 0, zone DMA 7 2 2 1 1 2 1
1 1 2 1
Node 0, zone DMA32 4567 3395 1357 846 439 190 93
61 43 23 4
Node 0, zone Normal 19 190 140 129 136 75 66
41 9 1 5
Node 1, zone Normal 194 1210 2080 1800 715 255 111
56 42 36 55
Node 2, zone Normal 204 768 3766 3394 1742 468 185
194 238 47 74
Node 3, zone Normal 1622 2137 1058 846 388 208 97
44 14 42 10
Node 4, zone Normal 282 705 623 274 184 90 63
41 11 1 28
Node 5, zone Normal 505 620 6180 3706 1724 1083 592
410 417 168 70
Node 6, zone Normal 1120 357 3314 3437 2264 872 606
209 215 123 265
Node 7, zone Normal 365 5499 12035 7486 3845 1743 635
243 309 292 78
Node 8, zone Normal 248 740 2280 1094 1225 2087 846
308 192 65 55
Node 9, zone Normal 356 763 1625 944 740 1920 1174
696 217 235 111
Node 10, zone Normal 727 1479 7002 6114 2487 1084
407 269 157 78 16
Node 11, zone Normal 189 3287 9141 5039 2560 1183
1247 693 506 252 8
Node 12, zone Normal 142 378 1317 466 1512 1568
646 359 248 264 228
Node 13, zone Normal 444 1977 3173 2625 2105 1493
931 600 369 266 230
Node 14, zone Normal 376 221 120 360 2721 2378
1521 826 442 204 59
Node 15, zone Normal 1210 966 922 2046 4128 2904
1518 744 352 102 58
> > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > pattern?
> > >
> > > The easiest way is to disable NUMA domain so that there would be only
> > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > has more memory and therefore they are less likely to become empty.
> > >
> > > > There is a free RAM in each numa node for the few MB used in
> > > > swap:
> > > > NUMA stats:
> > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > 65486 65486 65486 65486 65486 65486 65424
> > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > 2623 2833 2530 2269
> > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > multi-gen LRU.
> > >
> > > My questions:
> > > 1. Were there any OOM kills with either case?
> >
> > There is no OOM. The memory usage is not growing nor the swap space
> > usage, it is still a few MB there.
> >
> > > 2. Was THP enabled?
> >
> > Both situations with enabled and with disabled THP.
>
> My suspicion is that you packed the node 3 too perfectly :) And that
> might have triggered a known but currently a low priority problem in
> MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> for me in case v6.6 by itself still has the problem?
>
I would not focus just to node3, we had issues on different servers
with node0 and node2 both in parallel, but mostly it is the node3.
How our setup looks like:
* each node has 64GB of RAM,
* 61GB from it is in 1GB Huge Pages,
* rest 3GB is used by host system
There are running kvm VMs vCPUs pinned to the NUMA domains and using
the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
cpus), the qemu-kvm threads are pinned to the same numa domain as the
vCPUs. System services are not pinned, I'm not sure why the node3 is
used at most as the vms are balanced and the host's system services
can move between domains.
> > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > produce more THPs.
> > >
> > > If disabling the NUMA domain isn't an option, I'd recommend:
> >
> > Disabling numa is not an option. However we are now testing a setup
> > with -1GB in HugePages per each numa.
> >
> > > 1. Try the latest kernel (6.6.1) if you haven't.
> >
> > Not yet, the 6.6.1 was released today.
> >
> > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> >
> > I try disabling THP without any effect.
>
> Gochat. Please try the patch with MGLRU and let me know. Thanks!
>
> (Also CC Charan @ Qualcomm who initially reported the problem that
> ended up with the attached patch.)
I can try it. Will let you know.
next prev parent reply other threads:[~2023-11-09 6:39 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04 ` Jaroslav Pulchart
2023-11-08 22:09 ` Yu Zhao
2023-11-09 6:39 ` Jaroslav Pulchart [this message]
2023-11-09 6:48 ` Yu Zhao
2023-11-09 10:58 ` Jaroslav Pulchart
2023-11-10 1:31 ` Yu Zhao
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09 ` Yu Zhao
2023-11-14 7:29 ` Jaroslav Pulchart
2023-11-14 7:47 ` Yu Zhao
2023-11-20 8:41 ` Jaroslav Pulchart
2023-11-22 6:13 ` Yu Zhao
2023-11-22 7:12 ` Jaroslav Pulchart
2023-11-22 7:30 ` Jaroslav Pulchart
2023-11-22 14:18 ` Yu Zhao
2023-11-29 13:54 ` Jaroslav Pulchart
2023-12-01 23:52 ` Yu Zhao
2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
2024-01-03 21:30 ` Jaroslav Pulchart
2024-01-04 3:03 ` Yu Zhao
2024-01-04 9:46 ` Jaroslav Pulchart
2024-01-04 14:34 ` Jaroslav Pulchart
2024-01-04 23:51 ` Igor Raits
2024-01-05 17:35 ` Ertman, David M
2024-01-08 17:53 ` Jaroslav Pulchart
2024-01-16 4:58 ` Yu Zhao
2024-01-16 17:34 ` Jaroslav Pulchart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com' \
--to=jaroslav.pulchart@gooddata.com \
--cc=akpm@linux-foundation.org \
--cc=daniel.secik@gooddata.com \
--cc=igor.raits@gooddata.com \
--cc=linux-mm@kvack.org \
--cc=quic_charante@quicinc.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox