From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEC73C47073 for ; Mon, 8 Jan 2024 17:54:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CE408D0002; Mon, 8 Jan 2024 12:54:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6582A8D0001; Mon, 8 Jan 2024 12:54:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4AA648D0002; Mon, 8 Jan 2024 12:54:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 33D8D8D0001 for ; Mon, 8 Jan 2024 12:54:19 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id F0259140934 for ; Mon, 8 Jan 2024 17:54:18 +0000 (UTC) X-FDA: 81656892996.15.CED9614 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf24.hostedemail.com (Postfix) with ESMTP id 13E4B180011 for ; Mon, 8 Jan 2024 17:54:15 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=biGqd6Dn; spf=pass (imf24.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704736456; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+4ViBtN1pcNM8X/X1TwbpYRzc2yD/j72HI5RWZCZlxQ=; b=6kROwJVU82pku5MhXto1GEw1+rl7Fif0pdOtPipAf/kSvjNQTSKPG2O8qyT58nP1CiL3Hq cpPtT906GTGpYm6e8fTSJ8ptaF7Rc8UpmYIUbmZWGep7Z7CAWiQVQujbeCDKdTrK7fpOuG ipl+jVRHtLNQhbMY7zxmRJu1DaNPxi0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704736456; a=rsa-sha256; cv=none; b=jX9YFkNZYyQtMVdzY8ZBF7YUsD5rN9SDWh4FQsI5Bpp3n798eifh38EaDT3tyjwtV7AEc5 wb0phJT3ARCb7rLzkbAXQ5I7P+VQjszT6rOjtJBcKplPFeW6OFgI5s9SX2+9ZoecTni+1q uYEEwooe/dmqH+5ygTiIIbYoiChJgnM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=biGqd6Dn; spf=pass (imf24.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-55719cdc0e1so2364349a12.1 for ; Mon, 08 Jan 2024 09:54:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; t=1704736454; x=1705341254; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+4ViBtN1pcNM8X/X1TwbpYRzc2yD/j72HI5RWZCZlxQ=; b=biGqd6DnmB75UhEMzQ6tDcenrqTjNhBLeUUYUUAgdEATMae5UoGJuvQY/tskqpt3Eo qktsnL2+kpieeKTTH5n8lsPNxWl5ikJrf6myacIn9OHcTJuNMPtgTgeqsX7fNh1GQ3Tf vDI8kcLGvpgBo1c1H9HcMs+uSZ7xbn9AcaY8s= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704736454; x=1705341254; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+4ViBtN1pcNM8X/X1TwbpYRzc2yD/j72HI5RWZCZlxQ=; b=JBavcJLXLJVut2f2q8YI//wya+GV2CEVxRh0byeAa5OBUamFsmBjItzEw9Vf5708gN tZkDIoB7yPfueprpwb0L7qlPT3LTWQjmnDugvrEy8Eby+HVsjq8Qr3hpw+DnzmiXxZU0 fHkGoD1aKH4gZ17kVFubzSjsWHYeDH/HP6i3XG64ux99pvN/MIxofcpcZp5uFcpEmtuV foxepwAfjYdomu5QVbwPAH/PXSplJ+q3yv77OC1xeZO3mmSYeGaDIqy5Ee5hVI5ihBtj Pi0zB9v6/Mf4krGxJ5oFKwSYTtuB3i6Uv3i4dbQvCPpxrm0i1KIgusEW0STalRgRvAuB ABiQ== X-Gm-Message-State: AOJu0Yw8QmRg9BPUlFYSyCjVxIj5KKSXrdMqitSg63mvE1PNkM+zARTw h6IIzXfNLQgP8RG7Moh60kcFeQILR51wDcpvrQJTW87CjBAT X-Google-Smtp-Source: AGHT+IErQJaigUsTbQonsOJ6fWXNtK0MhBY7RWpFdV9z3KiFfz2Kd/FPsYPEJdQpD1Y8ufT4rxwFqn7rpASgljOh9rQ= X-Received: by 2002:a17:907:7e97:b0:a2a:e2b0:8dbb with SMTP id qb23-20020a1709077e9700b00a2ae2b08dbbmr776885ejc.90.1704736454439; Mon, 08 Jan 2024 09:54:14 -0800 (PST) MIME-Version: 1.0 References: <7df7e478-bd93-03df-5b10-19308f416e95@quicinc.com> In-Reply-To: From: Jaroslav Pulchart Date: Mon, 8 Jan 2024 18:53:48 +0100 Message-ID: Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU To: "Ertman, David M" , Yu Zhao Cc: Igor Raits , Daniel Secik , Charan Teja Kalla , Kalesh Singh , "akpm@linux-foundation.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: py4kqzoa1xf5jmgzn3uxra6i8g8buktg X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 13E4B180011 X-Rspam-User: X-HE-Tag: 1704736455-961631 X-HE-Meta: U2FsdGVkX1/OnAaNKVodCb35XezTc9gNqqZB80J7TePNYSpvzf27R84Sq68104oJH4+uswb27xDbAa7S+cO89nfrKac9veAcxewn0zFRUhD1r3quQj2CgFVGUMx7ukcN+du22/bCnMN/izdWY0IgpJ9vPwcrTS5gQoPO6urgTKLeNRJA5kG3vPxTUZmAyScSp43rD+LNV4ON9EXgEU84KwRupfQ2Je+/EpAzve83D9SJNsVSwn30EiXIZYlkP2MW/7+cEROSSaKtHUE51D4e6hICMjvtC6Ow0nSjXAH42f70qKoG7sBFE1QtgZHWXm5qnvOW4wZ+vIwCcP2MCcpIa9O6y1KXyKSOp5J2/dlWBiN1t9SLbzel+oQwhbEsVTwxFvKE8EUT491XTDgviiuQaYLVDdQXJquJXfIsjZGj5mSCzY6SPIzCW//bFCxbihyW/9ZFtQE6y7V/sRbc9THOHGHqGM7WC7BAmXhLT6hILuvygsVbCJh/dArWCeZM0Eh2EbtbgeLqHKmnrzdLQiI9N5tsAjqhf3ZZ7R3we2psDP5Ll/ZOUw8Noc4fmHWSwLCl71XFwfOOuduuTaMa9UTH/Nadm5pDTutXU7LuWnFYs1cVHEYDsNw7NAWNWDptdsGZypyPzbRTKRiFQM8SDxOjCIwt3X19pWRtCBvkykZF6Hx4OLefw76iOtXpP+bwG46RJbNVeGScVNrnrNVJ5wGk5J9/JzOg76EA3Bksef1UoiS0ZL1dHklJNAvCs/EwfYOYfQLMjYEZKXuQ6rXczGNKqpOrcyfAHEsns6Z11rw5OWNrmkyNJsZ33DuvVPxvvhKZ3b7XvyeMb62cvNUc/8vheI5ESpzQD07EEBB4nNKoGSc9vWLlscz7b+/Yu8sp9UPVJH/LL7c+I2onqNvZ3aCqiDcprXp4PPdlTDgTWi6O2IjOyT3/iGppw9Gt5kBPrsfDHTiT/cRrYHsk540B6Ur Gcf6qK+s aqg4/ayGAWizGIEO+fVkk7FAwtS3KPmWX3BjFeuJZW7gYFeg/GxIw6UiwmudoN+HMZTf6eaC4DvmZDtjL1Fy6PljhlpsA8N6xMTi501a8zRdEuMLASFVFtz5YsR0GSBPNMGoUGaq89NpXUJ1dYTwWX2UQ/UMceAuDVsaCWmW5d5UMQJKLQqNTC2dWY1YSx34+Dfp8pNN4qUNaRjX0fjf3LziIdyH1vTrjRBCUjzXKEGh3WaBpH1oa1jJRqkCeLwQdbMg0peyfvIYT6hNRla5xMaj5Kmtqz9SNuAXfZQuceUTPGYgUKUBK7e5KlwmCxk+9E6UX++iz/efQYpQfDEZHwj9L8XQdqegiVV8rxD3Lbm8FKIFGqML7ybkq0nMUatN+65zmoMkpC5kfUuWJEiY2pEOVH0dcJzO7zgXp8+B0CGgtYhJ3Vpl89lZavwfZppGnNGnMwXQv3Fn1nrA67VEgJhRiqrCGRi9ybOIE8BEjm5qAT7hu2AcsutamrIiTMDzl8+cuvCMyDbVPVsZS0dRkD7GYEA+isVMEnOV4Lfqw6nTd1mbYtTnvZdgCxrtYPtAjsq2iCc/Rnb3JrE1UmRXVb3ywnkUuOMaZcayRblRskQWjj+o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > > -----Original Message----- > > From: Igor Raits > > Sent: Thursday, January 4, 2024 3:51 PM > > To: Jaroslav Pulchart > > Cc: Yu Zhao ; Daniel Secik > > ; Charan Teja Kalla > > ; Kalesh Singh ; > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M > > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern > > with multi-gen LRU > > > > Hello everyone, > > > > On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart > > wrote: > > > > > > > > > > > > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > > > Charan, does the fix previously attached seem acceptable = to > > you? Any > > > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark' pag= es > > might > > > > > > > > succeed here but can fail in the immediate kswapd sleep, se= e > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary > > kswapd run time. > > > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary= it is > > > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > > > > > > > So, in downstream, we have something like for > > zone_watermark_ok(): > > > > > > > > unsigned long size =3D wmark_pages(zone, mark) + > > MIN_LRU_BATCH << 2; > > > > > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical val= ue, > > may be we > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned > > reasoning, is > > > > > > > > what all I can say for this patch. > > > > > > > > > > > > > > > > + mark =3D sysctl_numa_balancing_mode & > > NUMA_BALANCING_MEMORY_TIERING ? > > > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > > > + for (i =3D 0; i <=3D sc->reclaim_idx; i++) { > > > > > > > > + struct zone *zone =3D lruvec_pgdat(lruvec)-= >node_zones + > > i; > > > > > > > > + unsigned long size =3D wmark_pages(zone, ma= rk); > > > > > > > > + > > > > > > > > + if (managed_zone(zone) && > > > > > > > > + !zone_watermark_ok(zone, sc->order, siz= e, sc- > > >reclaim_idx, 0)) > > > > > > > > + return false; > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Charan > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Jaroslav Pulchart > > > > > > > Sr. Principal SW Engineer > > > > > > > GoodData > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > today we try to update servers to 6.6.9 which contains the mglr= u fixes > > > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 = 4:32.86 > > > > > > kswapd1 > > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 = 3:57.76 > > > > > > kswapd0 > > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 = 3:42.27 > > > > > > kswapd2 > > > > > > are the changes in upstream different compared to the initial p= atch > > > > > > which I tested? > > > > > > > > > > > > Best regards, > > > > > > Jaroslav Pulchart > > > > > > > > > > Hi Jaroslav, > > > > > > > > > > My apologies for all the trouble! > > > > > > > > > > Yes, there is a slight difference between the fix you verified an= d > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a specia= l > > > > > condition which I thought wouldn't affect you. > > > > > > > > > > Could you try the attached fix again on top of 6.6.9? It removed = that > > > > > special condition. > > > > > > > > > > Thanks! > > > > > > > > Thanks for prompt response. I did a test with the patch and it didn= 't > > > > help. The situation is super strange. > > > > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilizat= ion > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is = the > > > > worst situation, but the kswapd load is visible from 6.6.8. > > > > > > > > Setup of this server: > > > > * 4 chiplets per each sockets, there are 2 sockets > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid > > > > memory pressure however it is even worse now in contrary. > > > > > > > > kernel 6.6.7: I do not see kswapd usage when application started = =3D=3D OK > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application starte= d > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application = started > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > > > > > I run few more combinations, and here are results / findings: > > > > > > 6.6.7-1 (vanila) =3D=3D OK, no issue > > > > > > 6.6.8-1 (vanila) =3D=3D single kswapd 1= 00% ! > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issue > > > 6.6.8-1 (revert four mglru patches) =3D=3D OK, no issue > > > > > > 6.6.9-1 (vanila) =3D=3D four kswapd 100= % !!!! > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd 100= % !!!! > > > 6.6.9-3 (revert four mglru patches) =3D=3D four kswapd 100= % !!!! > > > > > > Summary: > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of > > > kernel 6.6.8, > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to > > > be related to mglru patches at all > > > > I was able to bisect this change and it looks like there is something > > going wrong with the ice driver=E2=80=A6 > > > > Usually after booting our server we see something like this. Most of > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes > > that have a really low amount of free memory and we don't know why but > > it looks like that in the end causes the constant swap in/out issue. > > With the final bit of the patch you've sent earlier in this thread it > > is almost invisible. > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 = 32696 > > MemFree: 2191 2828 92 292 3344 2916 3594 = 3222 > > > > > > However, after the following patch we see that more NUMA nodes have > > such a low amount of memory and that is causing constant reclaiming > > of memory because it looks like something inside of the kernel ate all > > the memory. This is right after the start of the system as well. > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 = 32696 > > MemFree: 46 59 51 33 3078 3535 2708 = 3511 > > > > The difference is 18G vs 12G of free memory sum'd across all NUMA > > nodes right after boot of the system. If you have some hints on how to > > debug what is actually occupying all that memory, maybe in both cases > > - would be happy to debug more! > > > > Dave, would you have any idea why that patch could cause such a boost > > in memory utilization? > > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f > > Author: Dave Ertman > > Date: Mon Dec 11 13:19:28 2023 -0800 > > > > ice: alter feature support check for SRIOV and LAG > > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] > > > > Previously, the ice driver had support for using a handler for bond= ing > > netdev events to ensure that conflicting features were not allowed = to be > > activated at the same time. While this was still in place, additio= nal > > support was added to specifically support SRIOV and LAG together. = These > > both utilized the netdev event handler, but the SRIOV and LAG featu= re > > was > > behind a capabilities feature check to make sure the current NVM ha= s > > support. > > > > The exclusion part of the event handler should be removed since the= re are > > users who have custom made solutions that depend on the non-exclusi= on > > of > > features. > > > > Wrap the creation/registration and cleanup of the event handler and > > associated structs in the probe flow with a feature check so that t= he > > only systems that support the full implementation of LAG features w= ill > > initialize support. This will leave other systems unhindered with > > functionality as it existed before any LAG code was added. > > Igor, > > I have no idea why that two line commit would do anything to increase mem= ory usage by the ice driver. > If anything, I would expect it to lower memory usage as it has the potent= ial to stop the allocation of memory > for the pf->lag struct. > > DaveE Hello, I believe we can track it as two different issues. So I reported the ICE driver commit as a email with subject "[REGRESSION] Intel ICE Ethernet driver in linux >=3D 6.6.9 triggers extra memory consumption and cause continous kswapd* usage and continuous swapping" to Jesse Brandeburg Tony Nguyen intel-wired-lan@lists.osuosl.org Dave Ertman Lets track the mglru here in this email thread. Yu, the kernel build with your mglru-fix-6.6.9.patch seem to be OK at least running it for 3days without kswapd usage (excluding the ice driver commit). Best! --=20 Jaroslav Pulchart