From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 094ADC47073 for ; Thu, 4 Jan 2024 23:51:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 24D1D6B00DE; Thu, 4 Jan 2024 18:51:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1FD246B00DF; Thu, 4 Jan 2024 18:51:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 076CD6B00E1; Thu, 4 Jan 2024 18:51:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DCA3D6B00DE for ; Thu, 4 Jan 2024 18:51:27 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 80F481403E1 for ; Thu, 4 Jan 2024 23:51:27 +0000 (UTC) X-FDA: 81643277814.04.34E3874 Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) by imf12.hostedemail.com (Postfix) with ESMTP id A30B340005 for ; Thu, 4 Jan 2024 23:51:24 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=RTxmupYF; spf=pass (imf12.hostedemail.com: domain of igor.raits@gooddata.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=igor.raits@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704412284; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SFUEyqhg7PCgaYYJarrQ9+pXkszitI70BpIRpK8HAcY=; b=BZ+8w+NuC9l+oqUcs08LNOlxwHH0YAWjD8rIg3jlkstMzPRJfKp+h6lC3WGMy/8XlsOKW8 Tu4KeqFdiwzOZvSq5/KpFDzd824jdNYTN/zztcpUFQN6V6YdJgzgR1qMOFoRCTe8LbTBpt fGM3gU9JN4ezjkDdw/sZcnRNza5xpzo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704412284; a=rsa-sha256; cv=none; b=BkI5xvOfOcuDrk6tOCbkUygJKhbA9jUNwsuk+9GN7BDqV5nv/fu9w5jxQ4ROpFRRKhAYrN Lz806qZ+WoXOtSBgePD+1mlg1w+5THP/reNj9TR0b5U2n2tB9EjK8BGX2VtDnOOEJOPH3Y /burEE5z2HwrKV0E9Nl3bmJ/rVcmg6M= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=RTxmupYF; spf=pass (imf12.hostedemail.com: domain of igor.raits@gooddata.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=igor.raits@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-2cd1232a2c7so13476391fa.0 for ; Thu, 04 Jan 2024 15:51:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; t=1704412283; x=1705017083; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SFUEyqhg7PCgaYYJarrQ9+pXkszitI70BpIRpK8HAcY=; b=RTxmupYFLQDZJmUJQi0eLtcCuP2p9jd9j6DBxkewrLpmPfFFU1U1tOTkmaONVSG43n l6XtC+GVJKhEn/ug9LOMlhLWZRf5S0lqPqMuX2OsHSAquCKky1HAqzScqObtM4UIe6P1 cq7OGF0FtRR0atVLSE+1r2F/HCxGzTsY41sIU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704412283; x=1705017083; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SFUEyqhg7PCgaYYJarrQ9+pXkszitI70BpIRpK8HAcY=; b=loVA+2RgDjWBaiNG6fjZrK5X1l5BekGAkPH+7Lzns562+/Zs+z71d/dFj08rgi3piv RuMfj716UL5orzZSktzNkC5jux3rmR6DMrH8sv2nleqKUOPyT6uMUBdsROF7H0czXRDO mNRQ7OXnHq1y30IDR5ZGLwU1FtJA4wvllfqVTZLECTEABHFYItaLrTu1t8ccUaupSU+j jAPntHKpOuo572+O6C1zS5xjqPESxiG3aIbhuRyAzmLNsjbsdlM2ckTkTGT3wolx1SoU lE/moCh1psBgshsH+Pza4QRDae08OcZOATvOm1Vp6K5vmu6iVcEER8yrW+9nEUsEPahB yLLw== X-Gm-Message-State: AOJu0YzQrKlStwxZny8Fw/1sOrQNJlgWFVNsXVot11pjejMGG0H8O3Cz hnEPSShefTgthYfafKWXFVLAwKqY1ZP0M1nyaPQmzEFazBz2 X-Google-Smtp-Source: AGHT+IG/gLi2C/DH20/OQRBteWMzKzD93WIHOryqWW2b3DPyUWgDw0ee11KepmhfaB/aiQByOmgjFMg+HryCRRTP1dI= X-Received: by 2002:a2e:9f54:0:b0:2cc:9435:a5f8 with SMTP id v20-20020a2e9f54000000b002cc9435a5f8mr663255ljk.6.1704412282753; Thu, 04 Jan 2024 15:51:22 -0800 (PST) MIME-Version: 1.0 References: <7df7e478-bd93-03df-5b10-19308f416e95@quicinc.com> In-Reply-To: From: Igor Raits Date: Fri, 5 Jan 2024 00:51:10 +0100 Message-ID: Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU To: Jaroslav Pulchart Cc: Yu Zhao , Daniel Secik , Charan Teja Kalla , Kalesh Singh , akpm@linux-foundation.org, linux-mm@kvack.org, Dave Ertman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: s9gt3j1dt8yd15pu3ezb5khrf39um1a8 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: A30B340005 X-Rspam-User: X-HE-Tag: 1704412284-418471 X-HE-Meta: U2FsdGVkX1/irTw0CxscQT3q2H11vLF5r+0QEkP82NJ3MNQwhLWKuC6SZeSLZhWDEV+RyP35J/5rFUsIBaf2ttjD3aGe7YF379flz3ZmzNwXwZJ+K/mTVqPN3b+TYUYLmW3OKV93nK3Ztp3cPxabmuZqj/uyPkh325W0BzJnYAf+by5pS+NBtmlWxrWD/JFs6ISCISbPVawiEK4xHDuYCbU0+3NzFnlkj8FdIr33f/yTyKuXBK8Vdh1Hm5vnGp65qP/fb9s7R8j0UjwVspkBQScxIzldFJjYG8hhVW3k7w3swQmkeT7c7Ex+dh8Z94YPLYv6l+DADIjTwe6vj0d9PWxUsWd5yoXmLTKSun58l6wdjduFdCMWRiKNG8wzq/O5hUSnn8K4DWuCR7FQkaIKeUGRd4iNMFRd1eAljEUSMO2aDTR5Vg2x2RW1u5XW62wOAuQbd2Q6uN95eTsPUaicQMGm9FZMF+5E0SOV308+tc4OBy7/3UVBhYmAHVUbazvgcCPALDFqcgxPwzz5QUwKf7ixv7anWrClrzoBVM9oPjxTeuzEg0yMchMKgKatNN/1vEklNs/XR4HilxYgZocYpNrn4k4zmsnL67p0MjdtfHzYyHcEkwQpnL7x4Rn72BOqfzttnpuS+H2SrMJMPJZB1Q2mYU7WI8QvogmgmoFgFAnb06Xj9+sYe4HTfrI++zSrfvTSZRrGaUfl2ItOw2rvJXqHD5rXkyvbizyJFe6pxF/71Y3NUWFgL9tJJ0n7OXHomcYaroKheNdwLq6Uk6TqdEVEh7kmOUjUbRqxNPym5tJ31dBsJZH+Na0vqRnzW49G0Nm4Y0090QFrKjQchNXj2JXTsEsbIFU32TTu4+NW2UJB4STA9J0lYxkArljgPyEJ+cxRj6EWtz4uAJplczYTCb4bcvDIa4XdW2EfniFQ5HdL+2dI2Un0aSLENmfP/hTmIz4PRh2FOiSE23NH1U2 9fdBuXWa xOPzxvSKOunD0isImy/VJHkF61/Uz3Hw3HE+jKAyy92tYRXiI+s5w3UOCQlaAXFjOumxxXTcgcLVWUXs8c29o31ve7UNcz5tY0xjp74zT1EG1+RA2Sa8uZCfQC8kWytdPjCiX4KRUlBuQolWzkZmgiqdTMgMCBoy2a4pw6BtJMqRRZ8cl4fibw1EiQ/kjP2VEWRxDtcTraNMWtbKnxOHPQqUd6FXxAOflT8XPJGGUOHrLjoOtUE+kO1UkDyjrfRIMyH6aQkyMJY3Va2mqHn+kinNMYIkCbDjOmnQv8veQGiU06U25050DMuDqg/ers/vK21BqGLaYP8D1IgOVRbWjgBTUQ5rTTwVTupXQcvUuVhixcIellxYD6UKOCr6tYuuEDQ78e/vHvZwmn/PS1OFusQZ9emn6PagOzrdQkpPWAWEW2OhkvP0DSF9mV3uUunpcGlOYmXsDQSFk0HU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hello everyone, On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart wrote: > > > > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > Charan, does the fix previously attached seem acceptable to y= ou? Any > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages m= ight > > > > > > succeed here but can fail in the immediate kswapd sleep, see > > > > > > prepare_kswapd_sleep(). This can show up into the increased > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time= . > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it = is > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > So, in downstream, we have something like for zone_watermark_ok= (): > > > > > > unsigned long size =3D wmark_pages(zone, mark) + MIN_LRU_BATCH = << 2; > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, = may be we > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reaso= ning, is > > > > > > what all I can say for this patch. > > > > > > > > > > > > + mark =3D sysctl_numa_balancing_mode & NUMA_BALANCING_ME= MORY_TIERING ? > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > + for (i =3D 0; i <=3D sc->reclaim_idx; i++) { > > > > > > + struct zone *zone =3D lruvec_pgdat(lruvec)->nod= e_zones + i; > > > > > > + unsigned long size =3D wmark_pages(zone, mark); > > > > > > + > > > > > > + if (managed_zone(zone) && > > > > > > + !zone_watermark_ok(zone, sc->order, size, s= c->reclaim_idx, 0)) > > > > > > + return false; > > > > > > + } > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Charan > > > > > > > > > > > > > > > > > > > > -- > > > > > Jaroslav Pulchart > > > > > Sr. Principal SW Engineer > > > > > GoodData > > > > > > > > > > > > Hello, > > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fi= xes > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32= .86 > > > > kswapd1 > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57= .76 > > > > kswapd0 > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42= .27 > > > > kswapd2 > > > > are the changes in upstream different compared to the initial patch > > > > which I tested? > > > > > > > > Best regards, > > > > Jaroslav Pulchart > > > > > > Hi Jaroslav, > > > > > > My apologies for all the trouble! > > > > > > Yes, there is a slight difference between the fix you verified and > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special > > > condition which I thought wouldn't affect you. > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that > > > special condition. > > > > > > Thanks! > > > > Thanks for prompt response. I did a test with the patch and it didn't > > help. The situation is super strange. > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the > > worst situation, but the kswapd load is visible from 6.6.8. > > > > Setup of this server: > > * 4 chiplets per each sockets, there are 2 sockets > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid > > memory pressure however it is even worse now in contrary. > > > > kernel 6.6.7: I do not see kswapd usage when application started =3D=3D= OK > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application star= ted > > NUMA nodes: 0 1 2 3 4 5 6 7 > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > I run few more combinations, and here are results / findings: > > 6.6.7-1 (vanila) =3D=3D OK, no issue > > 6.6.8-1 (vanila) =3D=3D single kswapd 100% = ! > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issue > 6.6.8-1 (revert four mglru patches) =3D=3D OK, no issue > > 6.6.9-1 (vanila) =3D=3D four kswapd 100% !!= !! > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd 100% !!= !! > 6.6.9-3 (revert four mglru patches) =3D=3D four kswapd 100% !!= !! > > Summary: > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of > kernel 6.6.8, > * there is (new?) problem in case of 6.6.9 kernel, which looks not to > be related to mglru patches at all I was able to bisect this change and it looks like there is something going wrong with the ice driver=E2=80=A6 Usually after booting our server we see something like this. Most of the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes that have a really low amount of free memory and we don't know why but it looks like that in the end causes the constant swap in/out issue. With the final bit of the patch you've sent earlier in this thread it is almost invisible. NUMA nodes: 0 1 2 3 4 5 6 7 HPTotalGiB: 28 28 28 28 28 28 28 28 HPFreeGiB: 28 28 28 28 28 28 28 28 MemTotal: 32264 32701 32659 32686 32701 32701 32701 326= 96 MemFree: 2191 2828 92 292 3344 2916 3594 322= 2 However, after the following patch we see that more NUMA nodes have such a low amount of memory and that is causing constant reclaiming of memory because it looks like something inside of the kernel ate all the memory. This is right after the start of the system as well. NUMA nodes: 0 1 2 3 4 5 6 7 HPTotalGiB: 28 28 28 28 28 28 28 28 HPFreeGiB: 28 28 28 28 28 28 28 28 MemTotal: 32264 32701 32659 32686 32701 32701 32701 326= 96 MemFree: 46 59 51 33 3078 3535 2708 351= 1 The difference is 18G vs 12G of free memory sum'd across all NUMA nodes right after boot of the system. If you have some hints on how to debug what is actually occupying all that memory, maybe in both cases - would be happy to debug more! Dave, would you have any idea why that patch could cause such a boost in memory utilization? commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f Author: Dave Ertman Date: Mon Dec 11 13:19:28 2023 -0800 ice: alter feature support check for SRIOV and LAG [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] Previously, the ice driver had support for using a handler for bonding netdev events to ensure that conflicting features were not allowed to b= e activated at the same time. While this was still in place, additional support was added to specifically support SRIOV and LAG together. Thes= e both utilized the netdev event handler, but the SRIOV and LAG feature w= as behind a capabilities feature check to make sure the current NVM has support. The exclusion part of the event handler should be removed since there a= re users who have custom made solutions that depend on the non-exclusion o= f features. Wrap the creation/registration and cleanup of the event handler and associated structs in the probe flow with a feature check so that the only systems that support the full implementation of LAG features will initialize support. This will leave other systems unhindered with functionality as it existed before any LAG code was added.