From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 600AEC4332F
	for <linux-mm@archiver.kernel.org>; Thu,  9 Nov 2023 06:39:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B6EA06B0339; Thu,  9 Nov 2023 01:39:45 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AF8636B033A; Thu,  9 Nov 2023 01:39:45 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9712A6B033B; Thu,  9 Nov 2023 01:39:45 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 8141D6B0339
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 01:39:45 -0500 (EST)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 46743C0D55
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 06:39:45 +0000 (UTC)
X-FDA: 81437465130.30.75CE275
Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47])
	by imf28.hostedemail.com (Postfix) with ESMTP id 5D30CC000B
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 06:39:43 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gooddata.com header.s=google header.b="nN/WtRAD";
	spf=pass (imf28.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com;
	dmarc=pass (policy=none) header.from=gooddata.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1699511983;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=;
	b=Pisxe/b23NNpd/Z/mlUrUqKmJHR/aOVjk6YMs9EAzphbTf9jnU64hc8xRqAqW8/Zjw977e
	+RcnX7xENNG1zlIStrOFVO1BverhzW4fY4M5UN0aZY27Tz3MqfpuwIQauebJ6dX8avta6x
	KsqLJ+Htw9+1XD5ZPdaUtidNewLdwRE=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699511983; a=rsa-sha256;
	cv=none;
	b=WwGu3jhYCNER7MRSm3jKZRaYyZcfpBFkVnlwqawVDX2cTvOXTJsFRkRL9gK3jPLixgnOLf
	DAXJCuNL/998wlNcF3FnudSRGKZL9crPq/qJnDk1eYNZIjDMiH++pyA2GnwSOd2Q6g6SQL
	89gH5ozFhfOkd149bXFUnwW5BH6Yi/w=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gooddata.com header.s=google header.b="nN/WtRAD";
	spf=pass (imf28.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com;
	dmarc=pass (policy=none) header.from=gooddata.com
Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-53e08b60febso721549a12.1
        for <linux-mm@kvack.org>; Wed, 08 Nov 2023 22:39:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gooddata.com; s=google; t=1699511981; x=1700116781; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=;
        b=nN/WtRADAW3FzI5fxZ2IhZXNzA/n+l1EFMbzSg7pAMPy7dQSl6qxbxlkrVhImJHBuZ
         AbP8Z6uT03O63e+oNrAxx5npInA6MuBH4CxWNaeQzLq4le9W7TzUQ6RSB3wiaT78CV06
         SOSEGv1I15/XOZER0/z9HH3jB6FzGL8aS8eRQ=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1699511981; x=1700116781;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=;
        b=bDBmBDkuDFyKUUkSRM6IsH+MfchQEHcAt1S1BTfmlrP4038BklaejBLmYeDw0sOmWV
         iJxCwFvumncj/PyUeJY6sAcUhTvbAzpe9oule3Py8nLxaDa8aUZ6LymZ21WEzxRW2iPD
         dMu5G8wDPIXgNGtCC3W3D/nymcfqapVbLGEdK48/WueBS83XiUWKny62vxFcdnAfzsrv
         MXg1TnoBzAZkxwcGBUwYt9A/Cr9k+6Qokw3vDUMPvUcuSTAIF5ZaIZ5egJwQ+lJk8XWg
         gRRQUZn3TjGsrFXUIcL8+l/o3o6FH6MDVGL5vwWfi+crnEo0etnRFE738XBy8/+MPp0r
         rYnQ==
X-Gm-Message-State: AOJu0YyLNXbBtu9GVbaZOLBYihxzck3oR2L27+EhhtcjTBF7LRG8ZUfb
	tpajDRRcbNxoP0PmuKR+SNWcpMoIF4NZn+gd5ZVErw==
X-Google-Smtp-Source: AGHT+IGg4/1dT9ex2Hq8Ay695b5oebRMZFwRcuqU+0/CsHjc1BMwrvhDE9xpWB/6y4LE4+KZkrIJc4mKrsII0erwJZo=
X-Received: by 2002:a17:907:1b19:b0:9cf:797b:6adc with SMTP id
 mp25-20020a1709071b1900b009cf797b6adcmr3179946ejc.33.1699511981552; Wed, 08
 Nov 2023 22:39:41 -0800 (PST)
MIME-Version: 1.0
References: <CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com>
 <CAOUHufYzBfc-NtYd6XRanqPKPwyLUDBG_VYMEB4G3PsVBwLDfg@mail.gmail.com>
 <CAK8fFZ7dxkMwJFvWrsqfWRbvCoxxtC0pBrFLR_2fuJ1FeHU8Cw@mail.gmail.com> <CAOUHufaxNQchy9gyPLVUq67uOcF8BkV5J93ZK5Vr+SosdXZw_g@mail.gmail.com>
In-Reply-To: <CAOUHufaxNQchy9gyPLVUq67uOcF8BkV5J93ZK5Vr+SosdXZw_g@mail.gmail.com>
From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Date: Thu, 9 Nov 2023 07:39:15 +0100
Message-ID: <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with
 multi-gen LRU
To: Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, 
	Igor Raits <igor.raits@gooddata.com>, Daniel Secik <daniel.secik@gooddata.com>, 
	Charan Teja Kalla <quic_charante@quicinc.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 5D30CC000B
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: tk3zbe1sq1cu8aryp7ptyobzjtqzig7p
X-HE-Tag: 1699511983-852360
X-HE-Meta: U2FsdGVkX19sxhGTdPbJtXUBmFiakz07ICbcYaQAA7ZnhQoeRHtwmhTmcQoTL7bZr2Uc07tQQtBUYbMCQfTsjD7loO/Wza5mVsPRdb7SoGVNZ7lHQmIp+Zi0mfNu3eRnWsd3Y4RvzBjfnEKfyo4hinPkGauSbm6iv1oZ9pVa5WTkwyQ3oheWITDy2kO1xBmRFx00FkBecE3JwIkwrMsi7OTFw179FPsDnRL01Q514+Sgq7HqafZyBoH0G2GJafQzLiJqV8Gsl0IdlJf4HitdnIgWwu3mhJAOYTYxdYEXRq25fC9D8tzbiYCoCWcNn4iA7fDZDt+LFnJHJUgK1Fcp1KD8f+N+ITj6C0C7lVua2S6c4EK2IRkclF19VGJGgOSUdw1+tQ9cDw2B3PX9jnwDPy5/Qd4DT+o5Eg3Jg6Dcwh5cc8N5fZYoQBf/po8NDfwHQ/cFkgNsQhSDlgjxyuBsNcgw7plfMYtY2RC9CZMoxoY/48I/QQ52Qhc90N5v4xtvlutoVIrI8PIjLzd48Mct0A5N0AOamlNCOpDwIvRcLaSAH7yTLSF5QyDkNX/h9ldr8NCG5AWwxdCzbN3MIi8h57xi21PFUlmnR3cubeXiMQuArKeQzMzNpaclskcagUq/vKPpPw+XGggtswcWarAOtq2FYs2OGn4vZ+idGnx7DSFX92gHz/Gm8LxmhKkDAdzev3VVcvQ9EEqeaeKgjnHmBhU96nvyvSbVN+gHocuBBUnBt0EHimLbTpkL56qD4qOphcYLKMvFBvMY6ximecSSlWFVgcWv7nVD+9VtkVWa8nMTnXKaz/DfRoH0yahIOfxeU8Vj9tptqz/JlT3p6XEHHE3kgjocC4/fyky8no9OW0l7vjZMjm2w4a9OCBXMeq1p/eonGFmyFi5nTs6Y02jWvsAl2Eb+Di7AP40reC68zT9f2kG9ElVU4E27KkKNUWkDVqcA0URvj6KCRhBa7cd
 BVyhJ9Yv
 mfD9RMbExPEn/mHfoMXe45W4jtg2ciG30CwoPxo/vtLqrENF6LOLUfrlxzvwA2Rwxwm4mbvG/lImO5ZziR0ktlkcr/IzPBUoBWlfJr+dhjDbrX+BqwtJl6VljBecR5zPt/f3kVBOGW2XwKtLjRPYqJdPOcQ7fRGvYejxxYOVREo+8YVqlH/MrkfS5U0s+fgj/yyDTy6SSJBo4g0xpy/xftY9/CdkBQ1xPKTvnEdQs1qinRsUEBDgIdvdZaDg0swO7T85chuhTrqTTcmBjqe/s79FRZ4uuoqlnX9nT7ZOp8SWEntgppj+2jt4gOaVkK+NgT7yN
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

>
> On Wed, Nov 8, 2023 at 12:04=E2=80=AFPM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > Hi Jaroslav,
> >
> > Hi Yu Zhao
> >
> > thanks for response, see answers inline:
> >
> > >
> > > On Wed, Nov 8, 2023 at 6:35=E2=80=AFAM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to report to you an unpleasant behavior of multi-gen L=
RU
> > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > system (16numa domains).
> > >
> > > Kernel version please?
> >
> > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > (6.4.y and maybe even the 6.3.y).
>
> v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> for you if you run into other problems with v6.6.
>

I will give it a try using 6.6.y. When it will work we can switch to
6.6.y instead of backporting the stuff to 6.5.y.

> > > > Symptoms of my issue are
> > > >
> > > > /A/ if mult-gen LRU is enabled
> > > > 1/ [kswapd3] is consuming 100% CPU
> > >
> > > Just thinking out loud: kswapd3 means the fourth node was under memor=
y pressure.
> > >
> > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.3=
4,
> > > > 18.26, 15.01
> > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0=
 zombie
> > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.=
6 buff/cache
> > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.=
7 avail Mem
> > > >     ...
> > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > 34969:04 kswapd3
> > > >     ...
> > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > observed with swap disk as well and cause IO latency issues due to
> > > > some kind of locking)
> > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > >
> > > >
> > > > /B/ if mult-gen LRU is disabled
> > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.0=
5,
> > > > 17.77, 14.77
> > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0=
 zombie
> > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.=
3 buff/cache
> > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.=
4 avail Mem
> > > >     ...
> > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > 34966:46 [kswapd3]
> > > >     ...
> > > > 2/ swap space usage is low (4MB)
> > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > >
> > > > Both situations are wrong as they are using swap in/out extensively=
,
> > > > however the multi-gen LRU situation is 10times worse.
> > >
> > > From the stats below, node 3 had the lowest free memory. So I think i=
n
> > > both cases, the reclaim activities were as expected.
> >
> > I do not see a reason for the memory pressure and reclaims. This node
> > has the lowest free memory of all nodes (~302MB free) that is true,
> > however the swap space usage is just 4MB (still going in and out). So
> > what can be the reason for that behaviour?
>
> The best analogy is that refuel (reclaim) happens before the tank
> becomes empty, and it happens even sooner when there is a long road
> ahead (high order allocations).
>
> > The workers/application is running in pre-allocated HugePages and the
> > rest is used for a small set of system services and drivers of
> > devices. It is static and not growing. The issue persists when I stop
> > the system services and free the memory.
>
> Yes, this helps.
>  Also could you attach /proc/buddyinfo from the moment
> you hit the problem?
>

I can. The problem is continuous, it is 100% of time continuously
doing in/out and consuming 100% of CPU and locking IO.

The output of /proc/buddyinfo is:

# cat /proc/buddyinfo
Node 0, zone      DMA      7      2      2      1      1      2      1
     1      1      2      1
Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
    61     43     23      4
Node 0, zone   Normal     19    190    140    129    136     75     66
    41      9      1      5
Node 1, zone   Normal    194   1210   2080   1800    715    255    111
    56     42     36     55
Node 2, zone   Normal    204    768   3766   3394   1742    468    185
   194    238     47     74
Node 3, zone   Normal   1622   2137   1058    846    388    208     97
    44     14     42     10
Node 4, zone   Normal    282    705    623    274    184     90     63
    41     11      1     28
Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
   410    417    168     70
Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
   209    215    123    265
Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
   243    309    292     78
Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
   308    192     65     55
Node 9, zone   Normal    356    763   1625    944    740   1920   1174
   696    217    235    111
Node 10, zone   Normal    727   1479   7002   6114   2487   1084
407    269    157     78     16
Node 11, zone   Normal    189   3287   9141   5039   2560   1183
1247    693    506    252      8
Node 12, zone   Normal    142    378   1317    466   1512   1568
646    359    248    264    228
Node 13, zone   Normal    444   1977   3173   2625   2105   1493
931    600    369    266    230
Node 14, zone   Normal    376    221    120    360   2721   2378
1521    826    442    204     59
Node 15, zone   Normal   1210    966    922   2046   4128   2904
1518    744    352    102     58


> > > > Could I ask for any suggestions on how to avoid the kswapd utilizat=
ion
> > > > pattern?
> > >
> > > The easiest way is to disable NUMA domain so that there would be only
> > > two nodes with 8x more memory. IOW, you have fewer pools but each poo=
l
> > > has more memory and therefore they are less likely to become empty.
> > >
> > > > There is a free RAM in each numa node for the few MB used in
> > > > swap:
> > > >     NUMA stats:
> > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > 65486 65486 65486 65486 65486 65486 65424
> > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 241=
7
> > > > 2623 2833 2530 2269
> > > > the in/out usage does not make sense for me nor the CPU utilization=
 by
> > > > multi-gen LRU.
> > >
> > > My questions:
> > > 1. Were there any OOM kills with either case?
> >
> > There is no OOM. The memory usage is not growing nor the swap space
> > usage, it is still a few MB there.
> >
> > > 2. Was THP enabled?
> >
> > Both situations with enabled and with disabled THP.
>
> My suspicion is that you packed the node 3 too perfectly :) And that
> might have triggered a known but currently a low priority problem in
> MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> for me in case v6.6 by itself still has the problem?
>

I would not focus just to node3, we had issues on different servers
with node0 and node2 both in parallel, but mostly it is the node3.

How our setup looks like:
* each node has 64GB of RAM,
* 61GB from it is in 1GB Huge Pages,
* rest 3GB is used by host system

There are running kvm VMs vCPUs pinned to the NUMA domains and using
the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
cpus), the qemu-kvm threads are pinned to the same numa domain as the
vCPUs. System services are not pinned, I'm not sure why the node3 is
used at most as the vms are balanced and the host's system services
can move between domains.

> > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > produce more THPs.
> > >
> > > If disabling the NUMA domain isn't an option, I'd recommend:
> >
> > Disabling numa is not an option. However we are now testing a setup
> > with -1GB in HugePages per each numa.
> >
> > > 1. Try the latest kernel (6.6.1) if you haven't.
> >
> > Not yet, the 6.6.1 was released today.
> >
> > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> >
> > I try disabling THP without any effect.
>
> Gochat. Please try the patch with MGLRU and let me know. Thanks!
>
> (Also CC Charan @ Qualcomm who initially reported the problem that
> ended up with the attached patch.)

I can try it. Will let you know.