From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 600AEC4332F for ; Thu, 9 Nov 2023 06:39:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6EA06B0339; Thu, 9 Nov 2023 01:39:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AF8636B033A; Thu, 9 Nov 2023 01:39:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9712A6B033B; Thu, 9 Nov 2023 01:39:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8141D6B0339 for ; Thu, 9 Nov 2023 01:39:45 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 46743C0D55 for ; Thu, 9 Nov 2023 06:39:45 +0000 (UTC) X-FDA: 81437465130.30.75CE275 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf28.hostedemail.com (Postfix) with ESMTP id 5D30CC000B for ; Thu, 9 Nov 2023 06:39:43 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b="nN/WtRAD"; spf=pass (imf28.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699511983; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=; b=Pisxe/b23NNpd/Z/mlUrUqKmJHR/aOVjk6YMs9EAzphbTf9jnU64hc8xRqAqW8/Zjw977e +RcnX7xENNG1zlIStrOFVO1BverhzW4fY4M5UN0aZY27Tz3MqfpuwIQauebJ6dX8avta6x KsqLJ+Htw9+1XD5ZPdaUtidNewLdwRE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699511983; a=rsa-sha256; cv=none; b=WwGu3jhYCNER7MRSm3jKZRaYyZcfpBFkVnlwqawVDX2cTvOXTJsFRkRL9gK3jPLixgnOLf DAXJCuNL/998wlNcF3FnudSRGKZL9crPq/qJnDk1eYNZIjDMiH++pyA2GnwSOd2Q6g6SQL 89gH5ozFhfOkd149bXFUnwW5BH6Yi/w= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b="nN/WtRAD"; spf=pass (imf28.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-53e08b60febso721549a12.1 for ; Wed, 08 Nov 2023 22:39:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; t=1699511981; x=1700116781; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=; b=nN/WtRADAW3FzI5fxZ2IhZXNzA/n+l1EFMbzSg7pAMPy7dQSl6qxbxlkrVhImJHBuZ AbP8Z6uT03O63e+oNrAxx5npInA6MuBH4CxWNaeQzLq4le9W7TzUQ6RSB3wiaT78CV06 SOSEGv1I15/XOZER0/z9HH3jB6FzGL8aS8eRQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699511981; x=1700116781; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=I20BevIgwuYtiooGcIt3k9xnx+7l9S1x0GjwruRS3go=; b=bDBmBDkuDFyKUUkSRM6IsH+MfchQEHcAt1S1BTfmlrP4038BklaejBLmYeDw0sOmWV iJxCwFvumncj/PyUeJY6sAcUhTvbAzpe9oule3Py8nLxaDa8aUZ6LymZ21WEzxRW2iPD dMu5G8wDPIXgNGtCC3W3D/nymcfqapVbLGEdK48/WueBS83XiUWKny62vxFcdnAfzsrv MXg1TnoBzAZkxwcGBUwYt9A/Cr9k+6Qokw3vDUMPvUcuSTAIF5ZaIZ5egJwQ+lJk8XWg gRRQUZn3TjGsrFXUIcL8+l/o3o6FH6MDVGL5vwWfi+crnEo0etnRFE738XBy8/+MPp0r rYnQ== X-Gm-Message-State: AOJu0YyLNXbBtu9GVbaZOLBYihxzck3oR2L27+EhhtcjTBF7LRG8ZUfb tpajDRRcbNxoP0PmuKR+SNWcpMoIF4NZn+gd5ZVErw== X-Google-Smtp-Source: AGHT+IGg4/1dT9ex2Hq8Ay695b5oebRMZFwRcuqU+0/CsHjc1BMwrvhDE9xpWB/6y4LE4+KZkrIJc4mKrsII0erwJZo= X-Received: by 2002:a17:907:1b19:b0:9cf:797b:6adc with SMTP id mp25-20020a1709071b1900b009cf797b6adcmr3179946ejc.33.1699511981552; Wed, 08 Nov 2023 22:39:41 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jaroslav Pulchart Date: Thu, 9 Nov 2023 07:39:15 +0100 Message-ID: Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU To: Yu Zhao Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Igor Raits , Daniel Secik , Charan Teja Kalla Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5D30CC000B X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: tk3zbe1sq1cu8aryp7ptyobzjtqzig7p X-HE-Tag: 1699511983-852360 X-HE-Meta: U2FsdGVkX19sxhGTdPbJtXUBmFiakz07ICbcYaQAA7ZnhQoeRHtwmhTmcQoTL7bZr2Uc07tQQtBUYbMCQfTsjD7loO/Wza5mVsPRdb7SoGVNZ7lHQmIp+Zi0mfNu3eRnWsd3Y4RvzBjfnEKfyo4hinPkGauSbm6iv1oZ9pVa5WTkwyQ3oheWITDy2kO1xBmRFx00FkBecE3JwIkwrMsi7OTFw179FPsDnRL01Q514+Sgq7HqafZyBoH0G2GJafQzLiJqV8Gsl0IdlJf4HitdnIgWwu3mhJAOYTYxdYEXRq25fC9D8tzbiYCoCWcNn4iA7fDZDt+LFnJHJUgK1Fcp1KD8f+N+ITj6C0C7lVua2S6c4EK2IRkclF19VGJGgOSUdw1+tQ9cDw2B3PX9jnwDPy5/Qd4DT+o5Eg3Jg6Dcwh5cc8N5fZYoQBf/po8NDfwHQ/cFkgNsQhSDlgjxyuBsNcgw7plfMYtY2RC9CZMoxoY/48I/QQ52Qhc90N5v4xtvlutoVIrI8PIjLzd48Mct0A5N0AOamlNCOpDwIvRcLaSAH7yTLSF5QyDkNX/h9ldr8NCG5AWwxdCzbN3MIi8h57xi21PFUlmnR3cubeXiMQuArKeQzMzNpaclskcagUq/vKPpPw+XGggtswcWarAOtq2FYs2OGn4vZ+idGnx7DSFX92gHz/Gm8LxmhKkDAdzev3VVcvQ9EEqeaeKgjnHmBhU96nvyvSbVN+gHocuBBUnBt0EHimLbTpkL56qD4qOphcYLKMvFBvMY6ximecSSlWFVgcWv7nVD+9VtkVWa8nMTnXKaz/DfRoH0yahIOfxeU8Vj9tptqz/JlT3p6XEHHE3kgjocC4/fyky8no9OW0l7vjZMjm2w4a9OCBXMeq1p/eonGFmyFi5nTs6Y02jWvsAl2Eb+Di7AP40reC68zT9f2kG9ElVU4E27KkKNUWkDVqcA0URvj6KCRhBa7cd BVyhJ9Yv mfD9RMbExPEn/mHfoMXe45W4jtg2ciG30CwoPxo/vtLqrENF6LOLUfrlxzvwA2Rwxwm4mbvG/lImO5ZziR0ktlkcr/IzPBUoBWlfJr+dhjDbrX+BqwtJl6VljBecR5zPt/f3kVBOGW2XwKtLjRPYqJdPOcQ7fRGvYejxxYOVREo+8YVqlH/MrkfS5U0s+fgj/yyDTy6SSJBo4g0xpy/xftY9/CdkBQ1xPKTvnEdQs1qinRsUEBDgIdvdZaDg0swO7T85chuhTrqTTcmBjqe/s79FRZ4uuoqlnX9nT7ZOp8SWEntgppj+2jt4gOaVkK+NgT7yN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > On Wed, Nov 8, 2023 at 12:04=E2=80=AFPM Jaroslav Pulchart > wrote: > > > > > > > > Hi Jaroslav, > > > > Hi Yu Zhao > > > > thanks for response, see answers inline: > > > > > > > > On Wed, Nov 8, 2023 at 6:35=E2=80=AFAM Jaroslav Pulchart > > > wrote: > > > > > > > > Hello, > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen L= RU > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3 > > > > system (16numa domains). > > > > > > Kernel version please? > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May > > (6.4.y and maybe even the 6.3.y). > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5 > for you if you run into other problems with v6.6. > I will give it a try using 6.6.y. When it will work we can switch to 6.6.y instead of backporting the stuff to 6.5.y. > > > > Symptoms of my issue are > > > > > > > > /A/ if mult-gen LRU is enabled > > > > 1/ [kswapd3] is consuming 100% CPU > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memor= y pressure. > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.3= 4, > > > > 18.26, 15.01 > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0= zombie > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi, > > > > 0.4 si, 0.0 st > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.= 6 buff/cache > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.= 7 avail Mem > > > > ... > > > > 765 root 20 0 0 0 0 R 98.3 0.0 > > > > 34969:04 kswapd3 > > > > ... > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was > > > > observed with swap disk as well and cause IO latency issues due to > > > > some kind of locking) > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out > > > > > > > > > > > > /B/ if mult-gen LRU is disabled > > > > 1/ [kswapd3] is consuming 3%-10% CPU > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.0= 5, > > > > 17.77, 14.77 > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0= zombie > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi, > > > > 0.4 si, 0.0 st > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.= 3 buff/cache > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.= 4 avail Mem > > > > ... > > > > 765 root 20 0 0 0 0 S 3.6 0.0 > > > > 34966:46 [kswapd3] > > > > ... > > > > 2/ swap space usage is low (4MB) > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out > > > > > > > > Both situations are wrong as they are using swap in/out extensively= , > > > > however the multi-gen LRU situation is 10times worse. > > > > > > From the stats below, node 3 had the lowest free memory. So I think i= n > > > both cases, the reclaim activities were as expected. > > > > I do not see a reason for the memory pressure and reclaims. This node > > has the lowest free memory of all nodes (~302MB free) that is true, > > however the swap space usage is just 4MB (still going in and out). So > > what can be the reason for that behaviour? > > The best analogy is that refuel (reclaim) happens before the tank > becomes empty, and it happens even sooner when there is a long road > ahead (high order allocations). > > > The workers/application is running in pre-allocated HugePages and the > > rest is used for a small set of system services and drivers of > > devices. It is static and not growing. The issue persists when I stop > > the system services and free the memory. > > Yes, this helps. > Also could you attach /proc/buddyinfo from the moment > you hit the problem? > I can. The problem is continuous, it is 100% of time continuously doing in/out and consuming 100% of CPU and locking IO. The output of /proc/buddyinfo is: # cat /proc/buddyinfo Node 0, zone DMA 7 2 2 1 1 2 1 1 1 2 1 Node 0, zone DMA32 4567 3395 1357 846 439 190 93 61 43 23 4 Node 0, zone Normal 19 190 140 129 136 75 66 41 9 1 5 Node 1, zone Normal 194 1210 2080 1800 715 255 111 56 42 36 55 Node 2, zone Normal 204 768 3766 3394 1742 468 185 194 238 47 74 Node 3, zone Normal 1622 2137 1058 846 388 208 97 44 14 42 10 Node 4, zone Normal 282 705 623 274 184 90 63 41 11 1 28 Node 5, zone Normal 505 620 6180 3706 1724 1083 592 410 417 168 70 Node 6, zone Normal 1120 357 3314 3437 2264 872 606 209 215 123 265 Node 7, zone Normal 365 5499 12035 7486 3845 1743 635 243 309 292 78 Node 8, zone Normal 248 740 2280 1094 1225 2087 846 308 192 65 55 Node 9, zone Normal 356 763 1625 944 740 1920 1174 696 217 235 111 Node 10, zone Normal 727 1479 7002 6114 2487 1084 407 269 157 78 16 Node 11, zone Normal 189 3287 9141 5039 2560 1183 1247 693 506 252 8 Node 12, zone Normal 142 378 1317 466 1512 1568 646 359 248 264 228 Node 13, zone Normal 444 1977 3173 2625 2105 1493 931 600 369 266 230 Node 14, zone Normal 376 221 120 360 2721 2378 1521 826 442 204 59 Node 15, zone Normal 1210 966 922 2046 4128 2904 1518 744 352 102 58 > > > > Could I ask for any suggestions on how to avoid the kswapd utilizat= ion > > > > pattern? > > > > > > The easiest way is to disable NUMA domain so that there would be only > > > two nodes with 8x more memory. IOW, you have fewer pools but each poo= l > > > has more memory and therefore they are less likely to become empty. > > > > > > > There is a free RAM in each numa node for the few MB used in > > > > swap: > > > > NUMA stats: > > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486 > > > > 65486 65486 65486 65486 65486 65486 65424 > > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 241= 7 > > > > 2623 2833 2530 2269 > > > > the in/out usage does not make sense for me nor the CPU utilization= by > > > > multi-gen LRU. > > > > > > My questions: > > > 1. Were there any OOM kills with either case? > > > > There is no OOM. The memory usage is not growing nor the swap space > > usage, it is still a few MB there. > > > > > 2. Was THP enabled? > > > > Both situations with enabled and with disabled THP. > > My suspicion is that you packed the node 3 too perfectly :) And that > might have triggered a known but currently a low priority problem in > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it > for me in case v6.6 by itself still has the problem? > I would not focus just to node3, we had issues on different servers with node0 and node2 both in parallel, but mostly it is the node3. How our setup looks like: * each node has 64GB of RAM, * 61GB from it is in 1GB Huge Pages, * rest 3GB is used by host system There are running kvm VMs vCPUs pinned to the NUMA domains and using the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared cpus), the qemu-kvm threads are pinned to the same numa domain as the vCPUs. System services are not pinned, I'm not sure why the node3 is used at most as the vms are balanced and the host's system services can move between domains. > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or > > > produce more THPs. > > > > > > If disabling the NUMA domain isn't an option, I'd recommend: > > > > Disabling numa is not an option. However we are now testing a setup > > with -1GB in HugePages per each numa. > > > > > 1. Try the latest kernel (6.6.1) if you haven't. > > > > Not yet, the 6.6.1 was released today. > > > > > 2. Disable THP if it was enabled, to verify whether it has an impact. > > > > I try disabling THP without any effect. > > Gochat. Please try the patch with MGLRU and let me know. Thanks! > > (Also CC Charan @ Qualcomm who initially reported the problem that > ended up with the attached patch.) I can try it. Will let you know.