linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	 Igor Raits <igor.raits@gooddata.com>,
	Daniel Secik <daniel.secik@gooddata.com>,
	 Charan Teja Kalla <quic_charante@quicinc.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
Date: Thu, 9 Nov 2023 07:39:15 +0100	[thread overview]
Message-ID: <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com> (raw)
In-Reply-To: <CAOUHufaxNQchy9gyPLVUq67uOcF8BkV5J93ZK5Vr+SosdXZw_g@mail.gmail.com>

>
> On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > Hi Jaroslav,
> >
> > Hi Yu Zhao
> >
> > thanks for response, see answers inline:
> >
> > >
> > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > system (16numa domains).
> > >
> > > Kernel version please?
> >
> > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > (6.4.y and maybe even the 6.3.y).
>
> v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> for you if you run into other problems with v6.6.
>

I will give it a try using 6.6.y. When it will work we can switch to
6.6.y instead of backporting the stuff to 6.5.y.

> > > > Symptoms of my issue are
> > > >
> > > > /A/ if mult-gen LRU is enabled
> > > > 1/ [kswapd3] is consuming 100% CPU
> > >
> > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > >
> > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > 18.26, 15.01
> > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > >     ...
> > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > 34969:04 kswapd3
> > > >     ...
> > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > observed with swap disk as well and cause IO latency issues due to
> > > > some kind of locking)
> > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > >
> > > >
> > > > /B/ if mult-gen LRU is disabled
> > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > 17.77, 14.77
> > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > >     ...
> > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > 34966:46 [kswapd3]
> > > >     ...
> > > > 2/ swap space usage is low (4MB)
> > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > >
> > > > Both situations are wrong as they are using swap in/out extensively,
> > > > however the multi-gen LRU situation is 10times worse.
> > >
> > > From the stats below, node 3 had the lowest free memory. So I think in
> > > both cases, the reclaim activities were as expected.
> >
> > I do not see a reason for the memory pressure and reclaims. This node
> > has the lowest free memory of all nodes (~302MB free) that is true,
> > however the swap space usage is just 4MB (still going in and out). So
> > what can be the reason for that behaviour?
>
> The best analogy is that refuel (reclaim) happens before the tank
> becomes empty, and it happens even sooner when there is a long road
> ahead (high order allocations).
>
> > The workers/application is running in pre-allocated HugePages and the
> > rest is used for a small set of system services and drivers of
> > devices. It is static and not growing. The issue persists when I stop
> > the system services and free the memory.
>
> Yes, this helps.
>  Also could you attach /proc/buddyinfo from the moment
> you hit the problem?
>

I can. The problem is continuous, it is 100% of time continuously
doing in/out and consuming 100% of CPU and locking IO.

The output of /proc/buddyinfo is:

# cat /proc/buddyinfo
Node 0, zone      DMA      7      2      2      1      1      2      1
     1      1      2      1
Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
    61     43     23      4
Node 0, zone   Normal     19    190    140    129    136     75     66
    41      9      1      5
Node 1, zone   Normal    194   1210   2080   1800    715    255    111
    56     42     36     55
Node 2, zone   Normal    204    768   3766   3394   1742    468    185
   194    238     47     74
Node 3, zone   Normal   1622   2137   1058    846    388    208     97
    44     14     42     10
Node 4, zone   Normal    282    705    623    274    184     90     63
    41     11      1     28
Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
   410    417    168     70
Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
   209    215    123    265
Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
   243    309    292     78
Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
   308    192     65     55
Node 9, zone   Normal    356    763   1625    944    740   1920   1174
   696    217    235    111
Node 10, zone   Normal    727   1479   7002   6114   2487   1084
407    269    157     78     16
Node 11, zone   Normal    189   3287   9141   5039   2560   1183
1247    693    506    252      8
Node 12, zone   Normal    142    378   1317    466   1512   1568
646    359    248    264    228
Node 13, zone   Normal    444   1977   3173   2625   2105   1493
931    600    369    266    230
Node 14, zone   Normal    376    221    120    360   2721   2378
1521    826    442    204     59
Node 15, zone   Normal   1210    966    922   2046   4128   2904
1518    744    352    102     58


> > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > pattern?
> > >
> > > The easiest way is to disable NUMA domain so that there would be only
> > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > has more memory and therefore they are less likely to become empty.
> > >
> > > > There is a free RAM in each numa node for the few MB used in
> > > > swap:
> > > >     NUMA stats:
> > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > 65486 65486 65486 65486 65486 65486 65424
> > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > 2623 2833 2530 2269
> > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > multi-gen LRU.
> > >
> > > My questions:
> > > 1. Were there any OOM kills with either case?
> >
> > There is no OOM. The memory usage is not growing nor the swap space
> > usage, it is still a few MB there.
> >
> > > 2. Was THP enabled?
> >
> > Both situations with enabled and with disabled THP.
>
> My suspicion is that you packed the node 3 too perfectly :) And that
> might have triggered a known but currently a low priority problem in
> MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> for me in case v6.6 by itself still has the problem?
>

I would not focus just to node3, we had issues on different servers
with node0 and node2 both in parallel, but mostly it is the node3.

How our setup looks like:
* each node has 64GB of RAM,
* 61GB from it is in 1GB Huge Pages,
* rest 3GB is used by host system

There are running kvm VMs vCPUs pinned to the NUMA domains and using
the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
cpus), the qemu-kvm threads are pinned to the same numa domain as the
vCPUs. System services are not pinned, I'm not sure why the node3 is
used at most as the vms are balanced and the host's system services
can move between domains.

> > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > produce more THPs.
> > >
> > > If disabling the NUMA domain isn't an option, I'd recommend:
> >
> > Disabling numa is not an option. However we are now testing a setup
> > with -1GB in HugePages per each numa.
> >
> > > 1. Try the latest kernel (6.6.1) if you haven't.
> >
> > Not yet, the 6.6.1 was released today.
> >
> > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> >
> > I try disabling THP without any effect.
>
> Gochat. Please try the patch with MGLRU and let me know. Thanks!
>
> (Also CC Charan @ Qualcomm who initially reported the problem that
> ended up with the attached patch.)

I can try it. Will let you know.


  reply	other threads:[~2023-11-09  6:39 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04   ` Jaroslav Pulchart
2023-11-08 22:09     ` Yu Zhao
2023-11-09  6:39       ` Jaroslav Pulchart [this message]
2023-11-09  6:48         ` Yu Zhao
2023-11-09 10:58           ` Jaroslav Pulchart
2023-11-10  1:31             ` Yu Zhao
     [not found]               ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09                 ` Yu Zhao
2023-11-14  7:29                   ` Jaroslav Pulchart
2023-11-14  7:47                     ` Yu Zhao
2023-11-20  8:41                       ` Jaroslav Pulchart
2023-11-22  6:13                         ` Yu Zhao
2023-11-22  7:12                           ` Jaroslav Pulchart
2023-11-22  7:30                             ` Jaroslav Pulchart
2023-11-22 14:18                               ` Yu Zhao
2023-11-29 13:54                                 ` Jaroslav Pulchart
2023-12-01 23:52                                   ` Yu Zhao
2023-12-07  8:46                                     ` Charan Teja Kalla
2023-12-07 18:23                                       ` Yu Zhao
2023-12-08  8:03                                       ` Jaroslav Pulchart
2024-01-03 21:30                                         ` Jaroslav Pulchart
2024-01-04  3:03                                           ` Yu Zhao
2024-01-04  9:46                                             ` Jaroslav Pulchart
2024-01-04 14:34                                               ` Jaroslav Pulchart
2024-01-04 23:51                                                 ` Igor Raits
2024-01-05 17:35                                                   ` Ertman, David M
2024-01-08 17:53                                                     ` Jaroslav Pulchart
2024-01-16  4:58                                                       ` Yu Zhao
2024-01-16 17:34                                                         ` Jaroslav Pulchart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com' \
    --to=jaroslav.pulchart@gooddata.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel.secik@gooddata.com \
    --cc=igor.raits@gooddata.com \
    --cc=linux-mm@kvack.org \
    --cc=quic_charante@quicinc.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox