From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
To: "Ertman, David M" <david.m.ertman@intel.com>,
Yu Zhao <yuzhao@google.com>
Cc: Igor Raits <igor@gooddata.com>,
Daniel Secik <daniel.secik@gooddata.com>,
Charan Teja Kalla <quic_charante@quicinc.com>,
Kalesh Singh <kaleshsingh@google.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
Date: Mon, 8 Jan 2024 18:53:48 +0100 [thread overview]
Message-ID: <CAK8fFZ6CRk_hXtkcE-5CEqTC5md_-B=6PdbzoMqLSFECWj6sWg@mail.gmail.com> (raw)
In-Reply-To: <MW5PR11MB5811948E3445D78F081023ECDD662@MW5PR11MB5811.namprd11.prod.outlook.com>
>
> > -----Original Message-----
> > From: Igor Raits <igor@gooddata.com>
> > Sent: Thursday, January 4, 2024 3:51 PM
> > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > <david.m.ertman@intel.com>
> > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > with multi-gen LRU
> >
> > Hello everyone,
> >
> > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > >
> > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Hi yu,
> > > > > > > >
> > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > you? Any
> > > > > > > > > additional feedback? Thanks.
> > > > > > > >
> > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > >
> > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > might
> > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > kswapd run time.
> > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > >
> > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > fixing the kswapd continuous run issue.
> > > > > > >
> > > > > > > >
> > > > > > > > So, in downstream, we have something like for
> > zone_watermark_ok():
> > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > MIN_LRU_BATCH << 2;
> > > > > > > >
> > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > may be we
> > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > reasoning, is
> > > > > > > > what all I can say for this patch.
> > > > > > > >
> > > > > > > > + mark = sysctl_numa_balancing_mode &
> > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > i;
> > > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > > +
> > > > > > > > + if (managed_zone(zone) &&
> > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> > >reclaim_idx, 0))
> > > > > > > > + return false;
> > > > > > > > + }
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Charan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jaroslav Pulchart
> > > > > > > Sr. Principal SW Engineer
> > > > > > > GoodData
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > >
> > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > > kswapd1
> > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > > kswapd0
> > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > > kswapd2
> > > > > > are the changes in upstream different compared to the initial patch
> > > > > > which I tested?
> > > > > >
> > > > > > Best regards,
> > > > > > Jaroslav Pulchart
> > > > >
> > > > > Hi Jaroslav,
> > > > >
> > > > > My apologies for all the trouble!
> > > > >
> > > > > Yes, there is a slight difference between the fix you verified and
> > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > condition which I thought wouldn't affect you.
> > > > >
> > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > special condition.
> > > > >
> > > > > Thanks!
> > > >
> > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > help. The situation is super strange.
> > > >
> > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > >
> > > > Setup of this server:
> > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > memory pressure however it is even worse now in contrary.
> > > >
> > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > >
> > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > >
> > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > >
> > > I run few more combinations, and here are results / findings:
> > >
> > > 6.6.7-1 (vanila) == OK, no issue
> > >
> > > 6.6.8-1 (vanila) == single kswapd 100% !
> > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > 6.6.8-1 (revert four mglru patches) == OK, no issue
> > >
> > > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> > >
> > > Summary:
> > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > kernel 6.6.8,
> > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > be related to mglru patches at all
> >
> > I was able to bisect this change and it looks like there is something
> > going wrong with the ice driver…
> >
> > Usually after booting our server we see something like this. Most of
> > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > that have a really low amount of free memory and we don't know why but
> > it looks like that in the end causes the constant swap in/out issue.
> > With the final bit of the patch you've sent earlier in this thread it
> > is almost invisible.
> >
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > MemFree: 2191 2828 92 292 3344 2916 3594 3222
> >
> >
> > However, after the following patch we see that more NUMA nodes have
> > such a low amount of memory and that is causing constant reclaiming
> > of memory because it looks like something inside of the kernel ate all
> > the memory. This is right after the start of the system as well.
> >
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > MemFree: 46 59 51 33 3078 3535 2708 3511
> >
> > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > nodes right after boot of the system. If you have some hints on how to
> > debug what is actually occupying all that memory, maybe in both cases
> > - would be happy to debug more!
> >
> > Dave, would you have any idea why that patch could cause such a boost
> > in memory utilization?
> >
> > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > Author: Dave Ertman <david.m.ertman@intel.com>
> > Date: Mon Dec 11 13:19:28 2023 -0800
> >
> > ice: alter feature support check for SRIOV and LAG
> >
> > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> >
> > Previously, the ice driver had support for using a handler for bonding
> > netdev events to ensure that conflicting features were not allowed to be
> > activated at the same time. While this was still in place, additional
> > support was added to specifically support SRIOV and LAG together. These
> > both utilized the netdev event handler, but the SRIOV and LAG feature
> > was
> > behind a capabilities feature check to make sure the current NVM has
> > support.
> >
> > The exclusion part of the event handler should be removed since there are
> > users who have custom made solutions that depend on the non-exclusion
> > of
> > features.
> >
> > Wrap the creation/registration and cleanup of the event handler and
> > associated structs in the probe flow with a feature check so that the
> > only systems that support the full implementation of LAG features will
> > initialize support. This will leave other systems unhindered with
> > functionality as it existed before any LAG code was added.
>
> Igor,
>
> I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> for the pf->lag struct.
>
> DaveE
Hello,
I believe we can track it as two different issues. So I reported the
ICE driver commit as a email with subject "[REGRESSION] Intel ICE
Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
and cause continous kswapd* usage and continuous swapping" to
Jesse Brandeburg <jesse.brandeburg@intel.com>
Tony Nguyen <anthony.l.nguyen@intel.com>
intel-wired-lan@lists.osuosl.org
Dave Ertman <david.m.ertman@intel.com>
Lets track the mglru here in this email thread. Yu, the kernel build
with your mglru-fix-6.6.9.patch seem to be OK at least running it for
3days without kswapd usage (excluding the ice driver commit).
Best!
--
Jaroslav Pulchart
next prev parent reply other threads:[~2024-01-08 17:54 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04 ` Jaroslav Pulchart
2023-11-08 22:09 ` Yu Zhao
2023-11-09 6:39 ` Jaroslav Pulchart
2023-11-09 6:48 ` Yu Zhao
2023-11-09 10:58 ` Jaroslav Pulchart
2023-11-10 1:31 ` Yu Zhao
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09 ` Yu Zhao
2023-11-14 7:29 ` Jaroslav Pulchart
2023-11-14 7:47 ` Yu Zhao
2023-11-20 8:41 ` Jaroslav Pulchart
2023-11-22 6:13 ` Yu Zhao
2023-11-22 7:12 ` Jaroslav Pulchart
2023-11-22 7:30 ` Jaroslav Pulchart
2023-11-22 14:18 ` Yu Zhao
2023-11-29 13:54 ` Jaroslav Pulchart
2023-12-01 23:52 ` Yu Zhao
2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
2024-01-03 21:30 ` Jaroslav Pulchart
2024-01-04 3:03 ` Yu Zhao
2024-01-04 9:46 ` Jaroslav Pulchart
2024-01-04 14:34 ` Jaroslav Pulchart
2024-01-04 23:51 ` Igor Raits
2024-01-05 17:35 ` Ertman, David M
2024-01-08 17:53 ` Jaroslav Pulchart [this message]
2024-01-16 4:58 ` Yu Zhao
2024-01-16 17:34 ` Jaroslav Pulchart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAK8fFZ6CRk_hXtkcE-5CEqTC5md_-B=6PdbzoMqLSFECWj6sWg@mail.gmail.com' \
--to=jaroslav.pulchart@gooddata.com \
--cc=akpm@linux-foundation.org \
--cc=daniel.secik@gooddata.com \
--cc=david.m.ertman@intel.com \
--cc=igor@gooddata.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=quic_charante@quicinc.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox