From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AF098C4332F
	for <linux-mm@archiver.kernel.org>; Thu,  9 Nov 2023 06:49:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2504B6B033B; Thu,  9 Nov 2023 01:49:10 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 200936B033C; Thu,  9 Nov 2023 01:49:10 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0C7B86B033D; Thu,  9 Nov 2023 01:49:10 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id F189F6B033B
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 01:49:09 -0500 (EST)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id B88E0B5F9C
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 06:49:09 +0000 (UTC)
X-FDA: 81437488818.12.A97979A
Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170])
	by imf04.hostedemail.com (Postfix) with ESMTP id 026494000C
	for <linux-mm@kvack.org>; Thu,  9 Nov 2023 06:49:07 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=t3aoR3w2;
	spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699512548; a=rsa-sha256;
	cv=none;
	b=yKQlWr6YHafAtduAL3r3arhuGMc0uJagGWip1gBVHf+ey6Q6zwIjzM+NThuYNbUXsrrzhj
	9d+6PbIJ5xyXn3cDzYl99PRSWBLUAJJtx7FGCx3PyyTshQ07mkjVYhP241W3fflIfkq+ul
	gBFsz9wUajrExoP93edUZlTpo/iuXek=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=t3aoR3w2;
	spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1699512548;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=;
	b=RzUaxjYkof1H0Vzx+LiNMM2OOzA49w7YKj2SWbC/Gz3XLBio98K+c83NEKVYRqN0WsKheN
	7Awugpc1UrU8j6EYhZy4rd6ve934XKfbSzzctPsznOF1/Se9Iho/dJ9Jiu6AWUPqbidHy2
	rCqPu4fgsiBp/pIuoe87BkbvKWfAmzI=
Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-41ea9c5e83cso189781cf.0
        for <linux-mm@kvack.org>; Wed, 08 Nov 2023 22:49:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1699512547; x=1700117347; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=;
        b=t3aoR3w2RJRdJxOc/9T77qsHymbdtjqVzV57piJwRnm9uP6jPSTkNxAxeekpK/euR7
         nd4rUVXfpeaioIpcpKj+5M1Cmj6um+IVm/LvBDEISuF0JR11IXyHroYeKj/HKHaGR2Wn
         LpV/8NCv5dqZCqs64dK401Uv5t7EN2WWirdDgMqQ/v1Huka85nnqwxxzQItmQY8GA22+
         6fWcqXux0MQGu3NmRpT0Vso/lCXsiXvp4ztLV7xULtxp+i7ouTrE9gUpUcdliILtMq/D
         kyD0+5yGv9rga9QKk7q/R1e0o6zdZd939zx8GCVUACqzlcbqYjIIgzLj+3n9b3GCSyaf
         G6rw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1699512547; x=1700117347;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=;
        b=Qfd+AidvQl5zDAzTi7aEPb20csqDKMoKXwEzNyhS1ios4P6NqxLlhPtWZg10AI9Qmp
         H9lG37sL86WltGpNp6WePcIuf0juooG2z6PDhcJyZ/eboxyefnRWzZ7FtRVK7GDLBvTf
         VfhoO63UPqQ0V2gnHyqWw4Owawmcu+pka4NmrpwAhlSUTthSbzuChRzPrfWHqEVq6l66
         ozbvwdSmePtw0qVrW5l/VYBpEWhrjDGaIwwUIANR4KAyh1zoUhOlyf+68/2Iy05LAh30
         1+dmdVghvVIND0VJls8OsaHVDRR5PiQQ2bgExGH8rGCU88KMGHV50/0QfcuW+qZW3JrU
         B4YQ==
X-Gm-Message-State: AOJu0YwOImNpNN+w7sFP0rP+uUjE6bJwx7Ea8J4gqdT9vVDPld4LzVLx
	qZuKdC+uS8Kv4jlPjbuXj1QdahRIvjTZzkJf0hh1Bx+GkfVjwG9p27M5iQ==
X-Google-Smtp-Source: AGHT+IEj7bo0D+Sk0IJNgb40Q1eTTSySxRI2aauOKjjs4SV0+I7j427EIGmP98DbNbWaXqHD+kRqMN6Tc/lLf6JT0ks=
X-Received: by 2002:a05:622a:750b:b0:420:cb67:8e4b with SMTP id
 jm11-20020a05622a750b00b00420cb678e4bmr156314qtb.2.1699512546806; Wed, 08 Nov
 2023 22:49:06 -0800 (PST)
MIME-Version: 1.0
References: <CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com>
 <CAOUHufYzBfc-NtYd6XRanqPKPwyLUDBG_VYMEB4G3PsVBwLDfg@mail.gmail.com>
 <CAK8fFZ7dxkMwJFvWrsqfWRbvCoxxtC0pBrFLR_2fuJ1FeHU8Cw@mail.gmail.com>
 <CAOUHufaxNQchy9gyPLVUq67uOcF8BkV5J93ZK5Vr+SosdXZw_g@mail.gmail.com> <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com>
In-Reply-To: <CAK8fFZ5Uez5VWDnR4Nk1FUO5Q47rr2g4=2heixkLoxCj7Cp22Q@mail.gmail.com>
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 8 Nov 2023 22:48:29 -0800
Message-ID: <CAOUHufb8RCBXCF_f33kO2HiEKK03nXu=W+PikfYRdnRM3kWo9w@mail.gmail.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with
 multi-gen LRU
To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, 
	Igor Raits <igor.raits@gooddata.com>, Daniel Secik <daniel.secik@gooddata.com>, 
	Charan Teja Kalla <quic_charante@quicinc.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 026494000C
X-Stat-Signature: izcnkzdt8nwc6g7z7fpkuigxb315dye8
X-Rspam-User: 
X-HE-Tag: 1699512547-792004
X-HE-Meta: U2FsdGVkX1/Q88Z/0iZc1h8QN3KIyVXSg8dETNNTW3Rbj6KOkh25sGq/ukZN2V2ehk5wFhS9bMBRzilwTaMu7Av+r8KUvqB5/y6qPCnptZeiMeEQBzQOtrZKQwTo1UwBYpMem5bnn5qrHcSgNyl0ARgkTkjmUF4fa7/IHlMH1Sjv3rU8rOg53yOgTOoZd9xfZczAaiiSaKo4qBjSuY2G0YkFegJNPab+buu+uZJHcZ+illidFS07DpjduE0/eY3/pCxriCC9o7JKiAc1K5yyXbTAzUP1oPKfQSu8Yv2+PyWvFsS9Yy2TfBB4fxt6vLt/SBmFEQni0abw+TRIbtHTTFtT+Em6+eCX08pmSznsaRqGyn5ISZ6S1OlHiXVgqAIm2PXtfqIy0Bln/zE4UkQBLviDlMYIYs1tXtMrlwXuzroNWp70PZQ6EsliVkLxQFPET+8aoKrYb1uoaUvBsYBew23YyvfUscIY/bQqw3I1LaURnzbSGED+QYWbHGVjUiwQIVLRYRN0DYQWbZu6MdB2ixq2dpQY+QhFxVPYPZPMG4ykp08PmPifE195RU7iPXZ8Mt22zvMs4gJ+ieRi3+uG/54DMh7DOjLl7tncMrQu++0x5MQ9rEEDATrW0qWFpVKq6RXpUdtFZB3bi0K8PC2lYS6lcvdUhQK3oWZ4i6TMkd5LfVxuYwmxfb/UNcTgWhiFGoylavN6fACH1rvId58ycP/UWnWNLUef0Pz9krUV7J/1+rSfwTS+iia7UC/TYCRaNQgjGcdvArj6TaF3DnNXaFXnSvklwsKrxMeZrN4RzlyVF/fguLDu04tZYhmG9Ht8dwIwKFu+X/FCLLdsdDDFdX689OuEUrA5h4SbDE9Yoy1wXfh4OuZy1/9sVGxoGeybF/f9lTuPXFawf4SfMlQOZOG+DbMuHZostiiIGZI5XS7ZnPzv2+874m8yTooW5dCRjWq/tbl+FNchPO2/aHS
 v4sr0Eox
 GPusKSo3KIcGOyTmuQRKLhgYlp/rMU3nK1I7eO4zTXGP15OSHmV9KkJOrVt/S8Wml4KumKU25m4OmL0zJHBAGpuhjvw42Geuax4XBo3zSAUJOLSi2oXCqUs3FoNQ4LT9nSo5MgtqOazhoIVhZXMdME63zEOaJrME/f+aartltFBkml9qbmDGjnd3lRhMe11SG9i16b5WF+I8+ea+AD3T3m3iyYBMPTUYNy9OESrlPfhXafrQgvItJNRUeTjl4dwH8YoGRzKhKK8HvK2k7DDOTetb5jmulAR51E0F7i9G8cMnZUfxHIiXif0Vfg+tG77xpDQSZ
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Nov 8, 2023 at 10:39=E2=80=AFPM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 12:04=E2=80=AFPM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > Hi Jaroslav,
> > >
> > > Hi Yu Zhao
> > >
> > > thanks for response, see answers inline:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 6:35=E2=80=AFAM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I would like to report to you an unpleasant behavior of multi-gen=
 LRU
> > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F=
3
> > > > > system (16numa domains).
> > > >
> > > > Kernel version please?
> > >
> > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > (6.4.y and maybe even the 6.3.y).
> >
> > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > for you if you run into other problems with v6.6.
> >
>
> I will give it a try using 6.6.y. When it will work we can switch to
> 6.6.y instead of backporting the stuff to 6.5.y.
>
> > > > > Symptoms of my issue are
> > > > >
> > > > > /A/ if mult-gen LRU is enabled
> > > > > 1/ [kswapd3] is consuming 100% CPU
> > > >
> > > > Just thinking out loud: kswapd3 means the fourth node was under mem=
ory pressure.
> > > >
> > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23=
.34,
> > > > > 18.26, 15.01
> > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,  =
 0 zombie
> > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi=
,
> > > > > 0.4 si,  0.0 st
> > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    76=
7.6 buff/cache
> > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  2595=
6.7 avail Mem
> > > > >     ...
> > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > 34969:04 kswapd3
> > > > >     ...
> > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (w=
as
> > > > > observed with swap disk as well and cause IO latency issues due t=
o
> > > > > some kind of locking)
> > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > >
> > > > >
> > > > > /B/ if mult-gen LRU is disabled
> > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23=
.05,
> > > > > 17.77, 14.77
> > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,  =
 0 zombie
> > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi=
,
> > > > > 0.4 si,  0.0 st
> > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    76=
7.3 buff/cache
> > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  2595=
2.4 avail Mem
> > > > >     ...
> > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > 34966:46 [kswapd3]
> > > > >     ...
> > > > > 2/ swap space usage is low (4MB)
> > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s o=
ut
> > > > >
> > > > > Both situations are wrong as they are using swap in/out extensive=
ly,
> > > > > however the multi-gen LRU situation is 10times worse.
> > > >
> > > > From the stats below, node 3 had the lowest free memory. So I think=
 in
> > > > both cases, the reclaim activities were as expected.
> > >
> > > I do not see a reason for the memory pressure and reclaims. This node
> > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > however the swap space usage is just 4MB (still going in and out). So
> > > what can be the reason for that behaviour?
> >
> > The best analogy is that refuel (reclaim) happens before the tank
> > becomes empty, and it happens even sooner when there is a long road
> > ahead (high order allocations).
> >
> > > The workers/application is running in pre-allocated HugePages and the
> > > rest is used for a small set of system services and drivers of
> > > devices. It is static and not growing. The issue persists when I stop
> > > the system services and free the memory.
> >
> > Yes, this helps.
> >  Also could you attach /proc/buddyinfo from the moment
> > you hit the problem?
> >
>
> I can. The problem is continuous, it is 100% of time continuously
> doing in/out and consuming 100% of CPU and locking IO.
>
> The output of /proc/buddyinfo is:
>
> # cat /proc/buddyinfo
> Node 0, zone      DMA      7      2      2      1      1      2      1
>      1      1      2      1
> Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
>     61     43     23      4
> Node 0, zone   Normal     19    190    140    129    136     75     66
>     41      9      1      5
> Node 1, zone   Normal    194   1210   2080   1800    715    255    111
>     56     42     36     55
> Node 2, zone   Normal    204    768   3766   3394   1742    468    185
>    194    238     47     74
> Node 3, zone   Normal   1622   2137   1058    846    388    208     97
>     44     14     42     10

Again, thinking out loud: there is only one zone on node 3, i.e., the
normal zone, and this excludes the problem commit
669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
reclaim") fixed in v6.6.

> Node 4, zone   Normal    282    705    623    274    184     90     63
>     41     11      1     28
> Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
>    410    417    168     70
> Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
>    209    215    123    265
> Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
>    243    309    292     78
> Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
>    308    192     65     55
> Node 9, zone   Normal    356    763   1625    944    740   1920   1174
>    696    217    235    111
> Node 10, zone   Normal    727   1479   7002   6114   2487   1084
> 407    269    157     78     16
> Node 11, zone   Normal    189   3287   9141   5039   2560   1183
> 1247    693    506    252      8
> Node 12, zone   Normal    142    378   1317    466   1512   1568
> 646    359    248    264    228
> Node 13, zone   Normal    444   1977   3173   2625   2105   1493
> 931    600    369    266    230
> Node 14, zone   Normal    376    221    120    360   2721   2378
> 1521    826    442    204     59
> Node 15, zone   Normal   1210    966    922   2046   4128   2904
> 1518    744    352    102     58
>
>
> > > > > Could I ask for any suggestions on how to avoid the kswapd utiliz=
ation
> > > > > pattern?
> > > >
> > > > The easiest way is to disable NUMA domain so that there would be on=
ly
> > > > two nodes with 8x more memory. IOW, you have fewer pools but each p=
ool
> > > > has more memory and therefore they are less likely to become empty.
> > > >
> > > > > There is a free RAM in each numa node for the few MB used in
> > > > > swap:
> > > > >     NUMA stats:
> > > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 654=
86
> > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2=
417
> > > > > 2623 2833 2530 2269
> > > > > the in/out usage does not make sense for me nor the CPU utilizati=
on by
> > > > > multi-gen LRU.
> > > >
> > > > My questions:
> > > > 1. Were there any OOM kills with either case?
> > >
> > > There is no OOM. The memory usage is not growing nor the swap space
> > > usage, it is still a few MB there.
> > >
> > > > 2. Was THP enabled?
> > >
> > > Both situations with enabled and with disabled THP.
> >
> > My suspicion is that you packed the node 3 too perfectly :) And that
> > might have triggered a known but currently a low priority problem in
> > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > for me in case v6.6 by itself still has the problem?
> >
>
> I would not focus just to node3, we had issues on different servers
> with node0 and node2 both in parallel, but mostly it is the node3.
>
> How our setup looks like:
> * each node has 64GB of RAM,
> * 61GB from it is in 1GB Huge Pages,
> * rest 3GB is used by host system
>
> There are running kvm VMs vCPUs pinned to the NUMA domains and using
> the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> cpus), the qemu-kvm threads are pinned to the same numa domain as the
> vCPUs. System services are not pinned, I'm not sure why the node3 is
> used at most as the vms are balanced and the host's system services
> can move between domains.
>
> > > > MGLRU might have spent the extra CPU cycles just to void OOM kills =
or
> > > > produce more THPs.
> > > >
> > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > >
> > > Disabling numa is not an option. However we are now testing a setup
> > > with -1GB in HugePages per each numa.
> > >
> > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > >
> > > Not yet, the 6.6.1 was released today.
> > >
> > > > 2. Disable THP if it was enabled, to verify whether it has an impac=
t.
> > >
> > > I try disabling THP without any effect.
> >
> > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> >
> > (Also CC Charan @ Qualcomm who initially reported the problem that
> > ended up with the attached patch.)
>
> I can try it. Will let you know.

Great, thanks!