Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	 Igor Raits <igor.raits@gooddata.com>,
	Daniel Secik <daniel.secik@gooddata.com>,
	 Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
Date: Thu, 9 Nov 2023 11:58:14 +0100	[thread overview]
Message-ID: <CAK8fFZ4ZMmvp__J9vsDB8TsX4908G4vDTh2nTkDwJ107LC3Odg@mail.gmail.com> (raw)
In-Reply-To: <CAOUHufb8RCBXCF_f33kO2HiEKK03nXu=W+PikfYRdnRM3kWo9w@mail.gmail.com>

>
> On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > Hi Jaroslav,
> > > >
> > > > Hi Yu Zhao
> > > >
> > > > thanks for response, see answers inline:
> > > >
> > > > >
> > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > system (16numa domains).
> > > > >
> > > > > Kernel version please?
> > > >
> > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > (6.4.y and maybe even the 6.3.y).
> > >
> > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > for you if you run into other problems with v6.6.
> > >
> >
> > I will give it a try using 6.6.y. When it will work we can switch to
> > 6.6.y instead of backporting the stuff to 6.5.y.
> >
> > > > > > Symptoms of my issue are
> > > > > >
> > > > > > /A/ if mult-gen LRU is enabled
> > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > >
> > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > >
> > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > 18.26, 15.01
> > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > 0.4 si,  0.0 st
> > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > >     ...
> > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > 34969:04 kswapd3
> > > > > >     ...
> > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > some kind of locking)
> > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > >
> > > > > >
> > > > > > /B/ if mult-gen LRU is disabled
> > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > 17.77, 14.77
> > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > 0.4 si,  0.0 st
> > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > >     ...
> > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > 34966:46 [kswapd3]
> > > > > >     ...
> > > > > > 2/ swap space usage is low (4MB)
> > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > >
> > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > however the multi-gen LRU situation is 10times worse.
> > > > >
> > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > both cases, the reclaim activities were as expected.
> > > >
> > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > however the swap space usage is just 4MB (still going in and out). So
> > > > what can be the reason for that behaviour?
> > >
> > > The best analogy is that refuel (reclaim) happens before the tank
> > > becomes empty, and it happens even sooner when there is a long road
> > > ahead (high order allocations).
> > >
> > > > The workers/application is running in pre-allocated HugePages and the
> > > > rest is used for a small set of system services and drivers of
> > > > devices. It is static and not growing. The issue persists when I stop
> > > > the system services and free the memory.
> > >
> > > Yes, this helps.
> > >  Also could you attach /proc/buddyinfo from the moment
> > > you hit the problem?
> > >
> >
> > I can. The problem is continuous, it is 100% of time continuously
> > doing in/out and consuming 100% of CPU and locking IO.
> >
> > The output of /proc/buddyinfo is:
> >
> > # cat /proc/buddyinfo
> > Node 0, zone      DMA      7      2      2      1      1      2      1
> >      1      1      2      1
> > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> >     61     43     23      4
> > Node 0, zone   Normal     19    190    140    129    136     75     66
> >     41      9      1      5
> > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> >     56     42     36     55
> > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> >    194    238     47     74
> > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> >     44     14     42     10
>
> Again, thinking out loud: there is only one zone on node 3, i.e., the
> normal zone, and this excludes the problem commit
> 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> reclaim") fixed in v6.6.

I built vanila 6.6.1 and did the first fast test - spin up and destroy
VMs only - This test does not always trigger the kswapd3 continuous
swap in/out  usage but it uses it and it  looks like there is a
change:

 I can see kswapd non-continous (15s and more) usage with 6.5.y
 # ps ax | grep [k]swapd
    753 ?        S      0:00 [kswapd0]
    754 ?        S      0:00 [kswapd1]
    755 ?        S      0:00 [kswapd2]
    756 ?        S      0:15 [kswapd3]    <<<<<<<<<
    757 ?        S      0:00 [kswapd4]
    758 ?        S      0:00 [kswapd5]
    759 ?        S      0:00 [kswapd6]
    760 ?        S      0:00 [kswapd7]
    761 ?        S      0:00 [kswapd8]
    762 ?        S      0:00 [kswapd9]
    763 ?        S      0:00 [kswapd10]
    764 ?        S      0:00 [kswapd11]
    765 ?        S      0:00 [kswapd12]
    766 ?        S      0:00 [kswapd13]
    767 ?        S      0:00 [kswapd14]
    768 ?        S      0:00 [kswapd15]

and none kswapd usage with 6.6.1, that looks to be promising path

# ps ax | grep [k]swapd
    808 ?        S      0:00 [kswapd0]
    809 ?        S      0:00 [kswapd1]
    810 ?        S      0:00 [kswapd2]
    811 ?        S      0:00 [kswapd3]    <<<< nice
    812 ?        S      0:00 [kswapd4]
    813 ?        S      0:00 [kswapd5]
    814 ?        S      0:00 [kswapd6]
    815 ?        S      0:00 [kswapd7]
    816 ?        S      0:00 [kswapd8]
    817 ?        S      0:00 [kswapd9]
    818 ?        S      0:00 [kswapd10]
    819 ?        S      0:00 [kswapd11]
    820 ?        S      0:00 [kswapd12]
    821 ?        S      0:00 [kswapd13]
    822 ?        S      0:00 [kswapd14]
    823 ?        S      0:00 [kswapd15]

I will install the 6.6.1 on the server which is doing some work and
observe it later today..


>
> > Node 4, zone   Normal    282    705    623    274    184     90     63
> >     41     11      1     28
> > Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
> >    410    417    168     70
> > Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
> >    209    215    123    265
> > Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
> >    243    309    292     78
> > Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
> >    308    192     65     55
> > Node 9, zone   Normal    356    763   1625    944    740   1920   1174
> >    696    217    235    111
> > Node 10, zone   Normal    727   1479   7002   6114   2487   1084
> > 407    269    157     78     16
> > Node 11, zone   Normal    189   3287   9141   5039   2560   1183
> > 1247    693    506    252      8
> > Node 12, zone   Normal    142    378   1317    466   1512   1568
> > 646    359    248    264    228
> > Node 13, zone   Normal    444   1977   3173   2625   2105   1493
> > 931    600    369    266    230
> > Node 14, zone   Normal    376    221    120    360   2721   2378
> > 1521    826    442    204     59
> > Node 15, zone   Normal   1210    966    922   2046   4128   2904
> > 1518    744    352    102     58
> >
> >
> > > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > > pattern?
> > > > >
> > > > > The easiest way is to disable NUMA domain so that there would be only
> > > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > > has more memory and therefore they are less likely to become empty.
> > > > >
> > > > > > There is a free RAM in each numa node for the few MB used in
> > > > > > swap:
> > > > > >     NUMA stats:
> > > > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > > 2623 2833 2530 2269
> > > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > > multi-gen LRU.
> > > > >
> > > > > My questions:
> > > > > 1. Were there any OOM kills with either case?
> > > >
> > > > There is no OOM. The memory usage is not growing nor the swap space
> > > > usage, it is still a few MB there.
> > > >
> > > > > 2. Was THP enabled?
> > > >
> > > > Both situations with enabled and with disabled THP.
> > >
> > > My suspicion is that you packed the node 3 too perfectly :) And that
> > > might have triggered a known but currently a low priority problem in
> > > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > > for me in case v6.6 by itself still has the problem?
> > >
> >
> > I would not focus just to node3, we had issues on different servers
> > with node0 and node2 both in parallel, but mostly it is the node3.
> >
> > How our setup looks like:
> > * each node has 64GB of RAM,
> > * 61GB from it is in 1GB Huge Pages,
> > * rest 3GB is used by host system
> >
> > There are running kvm VMs vCPUs pinned to the NUMA domains and using
> > the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> > cpus), the qemu-kvm threads are pinned to the same numa domain as the
> > vCPUs. System services are not pinned, I'm not sure why the node3 is
> > used at most as the vms are balanced and the host's system services
> > can move between domains.
> >
> > > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > > produce more THPs.
> > > > >
> > > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > > >
> > > > Disabling numa is not an option. However we are now testing a setup
> > > > with -1GB in HugePages per each numa.
> > > >
> > > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > > >
> > > > Not yet, the 6.6.1 was released today.
> > > >
> > > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > > >
> > > > I try disabling THP without any effect.
> > >
> > > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> > >
> > > (Also CC Charan @ Qualcomm who initially reported the problem that
> > > ended up with the attached patch.)
> >
> > I can try it. Will let you know.
>
> Great, thanks!

next prev parent reply	other threads:[~2023-11-09 10:58 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04   ` Jaroslav Pulchart
2023-11-08 22:09     ` Yu Zhao
2023-11-09  6:39       ` Jaroslav Pulchart
2023-11-09  6:48         ` Yu Zhao
2023-11-09 10:58           ` Jaroslav Pulchart [this message]
2023-11-10  1:31             ` Yu Zhao
     [not found]               ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09                 ` Yu Zhao
2023-11-14  7:29                   ` Jaroslav Pulchart
2023-11-14  7:47                     ` Yu Zhao
2023-11-20  8:41                       ` Jaroslav Pulchart
2023-11-22  6:13                         ` Yu Zhao
2023-11-22  7:12                           ` Jaroslav Pulchart
2023-11-22  7:30                             ` Jaroslav Pulchart
2023-11-22 14:18                               ` Yu Zhao
2023-11-29 13:54                                 ` Jaroslav Pulchart
2023-12-01 23:52                                   ` Yu Zhao
2023-12-07  8:46                                     ` Charan Teja Kalla
2023-12-07 18:23                                       ` Yu Zhao
2023-12-08  8:03                                       ` Jaroslav Pulchart
2024-01-03 21:30                                         ` Jaroslav Pulchart
2024-01-04  3:03                                           ` Yu Zhao
2024-01-04  9:46                                             ` Jaroslav Pulchart
2024-01-04 14:34                                               ` Jaroslav Pulchart
2024-01-04 23:51                                                 ` Igor Raits
2024-01-05 17:35                                                   ` Ertman, David M
2024-01-08 17:53                                                     ` Jaroslav Pulchart
2024-01-16  4:58                                                       ` Yu Zhao
2024-01-16 17:34                                                         ` Jaroslav Pulchart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAK8fFZ4ZMmvp__J9vsDB8TsX4908G4vDTh2nTkDwJ107LC3Odg@mail.gmail.com \
    --to=jaroslav.pulchart@gooddata.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.secik@gooddata.com \
    --cc=igor.raits@gooddata.com \
    --cc=linux-mm@kvack.org \
    --cc=quic_charante@quicinc.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox