From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF098C4332F for ; Thu, 9 Nov 2023 06:49:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2504B6B033B; Thu, 9 Nov 2023 01:49:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 200936B033C; Thu, 9 Nov 2023 01:49:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C7B86B033D; Thu, 9 Nov 2023 01:49:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id F189F6B033B for ; Thu, 9 Nov 2023 01:49:09 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B88E0B5F9C for ; Thu, 9 Nov 2023 06:49:09 +0000 (UTC) X-FDA: 81437488818.12.A97979A Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf04.hostedemail.com (Postfix) with ESMTP id 026494000C for ; Thu, 9 Nov 2023 06:49:07 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=t3aoR3w2; spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699512548; a=rsa-sha256; cv=none; b=yKQlWr6YHafAtduAL3r3arhuGMc0uJagGWip1gBVHf+ey6Q6zwIjzM+NThuYNbUXsrrzhj 9d+6PbIJ5xyXn3cDzYl99PRSWBLUAJJtx7FGCx3PyyTshQ07mkjVYhP241W3fflIfkq+ul gBFsz9wUajrExoP93edUZlTpo/iuXek= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=t3aoR3w2; spf=pass (imf04.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699512548; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=; b=RzUaxjYkof1H0Vzx+LiNMM2OOzA49w7YKj2SWbC/Gz3XLBio98K+c83NEKVYRqN0WsKheN 7Awugpc1UrU8j6EYhZy4rd6ve934XKfbSzzctPsznOF1/Se9Iho/dJ9Jiu6AWUPqbidHy2 rCqPu4fgsiBp/pIuoe87BkbvKWfAmzI= Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-41ea9c5e83cso189781cf.0 for ; Wed, 08 Nov 2023 22:49:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699512547; x=1700117347; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=; b=t3aoR3w2RJRdJxOc/9T77qsHymbdtjqVzV57piJwRnm9uP6jPSTkNxAxeekpK/euR7 nd4rUVXfpeaioIpcpKj+5M1Cmj6um+IVm/LvBDEISuF0JR11IXyHroYeKj/HKHaGR2Wn LpV/8NCv5dqZCqs64dK401Uv5t7EN2WWirdDgMqQ/v1Huka85nnqwxxzQItmQY8GA22+ 6fWcqXux0MQGu3NmRpT0Vso/lCXsiXvp4ztLV7xULtxp+i7ouTrE9gUpUcdliILtMq/D kyD0+5yGv9rga9QKk7q/R1e0o6zdZd939zx8GCVUACqzlcbqYjIIgzLj+3n9b3GCSyaf G6rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699512547; x=1700117347; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JEoxTSYAEtAkElFfa1l6xiJxp+OnWYVgWEXEKOqnPFU=; b=Qfd+AidvQl5zDAzTi7aEPb20csqDKMoKXwEzNyhS1ios4P6NqxLlhPtWZg10AI9Qmp H9lG37sL86WltGpNp6WePcIuf0juooG2z6PDhcJyZ/eboxyefnRWzZ7FtRVK7GDLBvTf VfhoO63UPqQ0V2gnHyqWw4Owawmcu+pka4NmrpwAhlSUTthSbzuChRzPrfWHqEVq6l66 ozbvwdSmePtw0qVrW5l/VYBpEWhrjDGaIwwUIANR4KAyh1zoUhOlyf+68/2Iy05LAh30 1+dmdVghvVIND0VJls8OsaHVDRR5PiQQ2bgExGH8rGCU88KMGHV50/0QfcuW+qZW3JrU B4YQ== X-Gm-Message-State: AOJu0YwOImNpNN+w7sFP0rP+uUjE6bJwx7Ea8J4gqdT9vVDPld4LzVLx qZuKdC+uS8Kv4jlPjbuXj1QdahRIvjTZzkJf0hh1Bx+GkfVjwG9p27M5iQ== X-Google-Smtp-Source: AGHT+IEj7bo0D+Sk0IJNgb40Q1eTTSySxRI2aauOKjjs4SV0+I7j427EIGmP98DbNbWaXqHD+kRqMN6Tc/lLf6JT0ks= X-Received: by 2002:a05:622a:750b:b0:420:cb67:8e4b with SMTP id jm11-20020a05622a750b00b00420cb678e4bmr156314qtb.2.1699512546806; Wed, 08 Nov 2023 22:49:06 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yu Zhao Date: Wed, 8 Nov 2023 22:48:29 -0800 Message-ID: Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU To: Jaroslav Pulchart Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Igor Raits , Daniel Secik , Charan Teja Kalla Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 026494000C X-Stat-Signature: izcnkzdt8nwc6g7z7fpkuigxb315dye8 X-Rspam-User: X-HE-Tag: 1699512547-792004 X-HE-Meta: U2FsdGVkX1/Q88Z/0iZc1h8QN3KIyVXSg8dETNNTW3Rbj6KOkh25sGq/ukZN2V2ehk5wFhS9bMBRzilwTaMu7Av+r8KUvqB5/y6qPCnptZeiMeEQBzQOtrZKQwTo1UwBYpMem5bnn5qrHcSgNyl0ARgkTkjmUF4fa7/IHlMH1Sjv3rU8rOg53yOgTOoZd9xfZczAaiiSaKo4qBjSuY2G0YkFegJNPab+buu+uZJHcZ+illidFS07DpjduE0/eY3/pCxriCC9o7JKiAc1K5yyXbTAzUP1oPKfQSu8Yv2+PyWvFsS9Yy2TfBB4fxt6vLt/SBmFEQni0abw+TRIbtHTTFtT+Em6+eCX08pmSznsaRqGyn5ISZ6S1OlHiXVgqAIm2PXtfqIy0Bln/zE4UkQBLviDlMYIYs1tXtMrlwXuzroNWp70PZQ6EsliVkLxQFPET+8aoKrYb1uoaUvBsYBew23YyvfUscIY/bQqw3I1LaURnzbSGED+QYWbHGVjUiwQIVLRYRN0DYQWbZu6MdB2ixq2dpQY+QhFxVPYPZPMG4ykp08PmPifE195RU7iPXZ8Mt22zvMs4gJ+ieRi3+uG/54DMh7DOjLl7tncMrQu++0x5MQ9rEEDATrW0qWFpVKq6RXpUdtFZB3bi0K8PC2lYS6lcvdUhQK3oWZ4i6TMkd5LfVxuYwmxfb/UNcTgWhiFGoylavN6fACH1rvId58ycP/UWnWNLUef0Pz9krUV7J/1+rSfwTS+iia7UC/TYCRaNQgjGcdvArj6TaF3DnNXaFXnSvklwsKrxMeZrN4RzlyVF/fguLDu04tZYhmG9Ht8dwIwKFu+X/FCLLdsdDDFdX689OuEUrA5h4SbDE9Yoy1wXfh4OuZy1/9sVGxoGeybF/f9lTuPXFawf4SfMlQOZOG+DbMuHZostiiIGZI5XS7ZnPzv2+874m8yTooW5dCRjWq/tbl+FNchPO2/aHS v4sr0Eox GPusKSo3KIcGOyTmuQRKLhgYlp/rMU3nK1I7eO4zTXGP15OSHmV9KkJOrVt/S8Wml4KumKU25m4OmL0zJHBAGpuhjvw42Geuax4XBo3zSAUJOLSi2oXCqUs3FoNQ4LT9nSo5MgtqOazhoIVhZXMdME63zEOaJrME/f+aartltFBkml9qbmDGjnd3lRhMe11SG9i16b5WF+I8+ea+AD3T3m3iyYBMPTUYNy9OESrlPfhXafrQgvItJNRUeTjl4dwH8YoGRzKhKK8HvK2k7DDOTetb5jmulAR51E0F7i9G8cMnZUfxHIiXif0Vfg+tG77xpDQSZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 8, 2023 at 10:39=E2=80=AFPM Jaroslav Pulchart wrote: > > > > > On Wed, Nov 8, 2023 at 12:04=E2=80=AFPM Jaroslav Pulchart > > wrote: > > > > > > > > > > > Hi Jaroslav, > > > > > > Hi Yu Zhao > > > > > > thanks for response, see answers inline: > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35=E2=80=AFAM Jaroslav Pulchart > > > > wrote: > > > > > > > > > > Hello, > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen= LRU > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F= 3 > > > > > system (16numa domains). > > > > > > > > Kernel version please? > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May > > > (6.4.y and maybe even the 6.3.y). > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5 > > for you if you run into other problems with v6.6. > > > > I will give it a try using 6.6.y. When it will work we can switch to > 6.6.y instead of backporting the stuff to 6.5.y. > > > > > > Symptoms of my issue are > > > > > > > > > > /A/ if mult-gen LRU is enabled > > > > > 1/ [kswapd3] is consuming 100% CPU > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under mem= ory pressure. > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23= .34, > > > > > 18.26, 15.01 > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, = 0 zombie > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi= , > > > > > 0.4 si, 0.0 st > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 76= 7.6 buff/cache > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 2595= 6.7 avail Mem > > > > > ... > > > > > 765 root 20 0 0 0 0 R 98.3 0.0 > > > > > 34969:04 kswapd3 > > > > > ... > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (w= as > > > > > observed with swap disk as well and cause IO latency issues due t= o > > > > > some kind of locking) > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled > > > > > 1/ [kswapd3] is consuming 3%-10% CPU > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23= .05, > > > > > 17.77, 14.77 > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, = 0 zombie > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi= , > > > > > 0.4 si, 0.0 st > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 76= 7.3 buff/cache > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 2595= 2.4 avail Mem > > > > > ... > > > > > 765 root 20 0 0 0 0 S 3.6 0.0 > > > > > 34966:46 [kswapd3] > > > > > ... > > > > > 2/ swap space usage is low (4MB) > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s o= ut > > > > > > > > > > Both situations are wrong as they are using swap in/out extensive= ly, > > > > > however the multi-gen LRU situation is 10times worse. > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think= in > > > > both cases, the reclaim activities were as expected. > > > > > > I do not see a reason for the memory pressure and reclaims. This node > > > has the lowest free memory of all nodes (~302MB free) that is true, > > > however the swap space usage is just 4MB (still going in and out). So > > > what can be the reason for that behaviour? > > > > The best analogy is that refuel (reclaim) happens before the tank > > becomes empty, and it happens even sooner when there is a long road > > ahead (high order allocations). > > > > > The workers/application is running in pre-allocated HugePages and the > > > rest is used for a small set of system services and drivers of > > > devices. It is static and not growing. The issue persists when I stop > > > the system services and free the memory. > > > > Yes, this helps. > > Also could you attach /proc/buddyinfo from the moment > > you hit the problem? > > > > I can. The problem is continuous, it is 100% of time continuously > doing in/out and consuming 100% of CPU and locking IO. > > The output of /proc/buddyinfo is: > > # cat /proc/buddyinfo > Node 0, zone DMA 7 2 2 1 1 2 1 > 1 1 2 1 > Node 0, zone DMA32 4567 3395 1357 846 439 190 93 > 61 43 23 4 > Node 0, zone Normal 19 190 140 129 136 75 66 > 41 9 1 5 > Node 1, zone Normal 194 1210 2080 1800 715 255 111 > 56 42 36 55 > Node 2, zone Normal 204 768 3766 3394 1742 468 185 > 194 238 47 74 > Node 3, zone Normal 1622 2137 1058 846 388 208 97 > 44 14 42 10 Again, thinking out loud: there is only one zone on node 3, i.e., the normal zone, and this excludes the problem commit 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone reclaim") fixed in v6.6. > Node 4, zone Normal 282 705 623 274 184 90 63 > 41 11 1 28 > Node 5, zone Normal 505 620 6180 3706 1724 1083 592 > 410 417 168 70 > Node 6, zone Normal 1120 357 3314 3437 2264 872 606 > 209 215 123 265 > Node 7, zone Normal 365 5499 12035 7486 3845 1743 635 > 243 309 292 78 > Node 8, zone Normal 248 740 2280 1094 1225 2087 846 > 308 192 65 55 > Node 9, zone Normal 356 763 1625 944 740 1920 1174 > 696 217 235 111 > Node 10, zone Normal 727 1479 7002 6114 2487 1084 > 407 269 157 78 16 > Node 11, zone Normal 189 3287 9141 5039 2560 1183 > 1247 693 506 252 8 > Node 12, zone Normal 142 378 1317 466 1512 1568 > 646 359 248 264 228 > Node 13, zone Normal 444 1977 3173 2625 2105 1493 > 931 600 369 266 230 > Node 14, zone Normal 376 221 120 360 2721 2378 > 1521 826 442 204 59 > Node 15, zone Normal 1210 966 922 2046 4128 2904 > 1518 744 352 102 58 > > > > > > > Could I ask for any suggestions on how to avoid the kswapd utiliz= ation > > > > > pattern? > > > > > > > > The easiest way is to disable NUMA domain so that there would be on= ly > > > > two nodes with 8x more memory. IOW, you have fewer pools but each p= ool > > > > has more memory and therefore they are less likely to become empty. > > > > > > > > > There is a free RAM in each numa node for the few MB used in > > > > > swap: > > > > > NUMA stats: > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 654= 86 > > > > > 65486 65486 65486 65486 65486 65486 65424 > > > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2= 417 > > > > > 2623 2833 2530 2269 > > > > > the in/out usage does not make sense for me nor the CPU utilizati= on by > > > > > multi-gen LRU. > > > > > > > > My questions: > > > > 1. Were there any OOM kills with either case? > > > > > > There is no OOM. The memory usage is not growing nor the swap space > > > usage, it is still a few MB there. > > > > > > > 2. Was THP enabled? > > > > > > Both situations with enabled and with disabled THP. > > > > My suspicion is that you packed the node 3 too perfectly :) And that > > might have triggered a known but currently a low priority problem in > > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it > > for me in case v6.6 by itself still has the problem? > > > > I would not focus just to node3, we had issues on different servers > with node0 and node2 both in parallel, but mostly it is the node3. > > How our setup looks like: > * each node has 64GB of RAM, > * 61GB from it is in 1GB Huge Pages, > * rest 3GB is used by host system > > There are running kvm VMs vCPUs pinned to the NUMA domains and using > the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared > cpus), the qemu-kvm threads are pinned to the same numa domain as the > vCPUs. System services are not pinned, I'm not sure why the node3 is > used at most as the vms are balanced and the host's system services > can move between domains. > > > > > MGLRU might have spent the extra CPU cycles just to void OOM kills = or > > > > produce more THPs. > > > > > > > > If disabling the NUMA domain isn't an option, I'd recommend: > > > > > > Disabling numa is not an option. However we are now testing a setup > > > with -1GB in HugePages per each numa. > > > > > > > 1. Try the latest kernel (6.6.1) if you haven't. > > > > > > Not yet, the 6.6.1 was released today. > > > > > > > 2. Disable THP if it was enabled, to verify whether it has an impac= t. > > > > > > I try disabling THP without any effect. > > > > Gochat. Please try the patch with MGLRU and let me know. Thanks! > > > > (Also CC Charan @ Qualcomm who initially reported the problem that > > ended up with the attached patch.) > > I can try it. Will let you know. Great, thanks!