From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D623C47077 for ; Tue, 16 Jan 2024 17:35:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BB0416B0098; Tue, 16 Jan 2024 12:35:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B602A6B0099; Tue, 16 Jan 2024 12:35:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B25F6B009A; Tue, 16 Jan 2024 12:35:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 838046B0098 for ; Tue, 16 Jan 2024 12:35:19 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D09CF140656 for ; Tue, 16 Jan 2024 17:35:18 +0000 (UTC) X-FDA: 81685875516.25.2B7AF9F Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf03.hostedemail.com (Postfix) with ESMTP id C438120014 for ; Tue, 16 Jan 2024 17:35:16 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=p8412zaX; spf=pass (imf03.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705426517; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=; b=sSyynJqdqYHCLHuC86eHeOknJdelWW29+uAJXuWqEGVpwtFGLzDr2kiZl6sKi+4NoMTOYL +2GmoN12W5OpWq7rlYsj2c9jvYXPggZ3t83XbRo2Igq8oJIjXR+YgP9rFKWOysEe6Uon0Z UZY/UUfDBIKA3dtYHzJqBPwIpOppizI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705426517; a=rsa-sha256; cv=none; b=baXJzYxn7PfZUZO3sww+ynG/dIo9njsiuolpUmEIUaVwv51ni5JIkUKsEOkbPZ8S5iddOC 4pZ4m2KsZSRJrtBG+PKoyUwvdzVsWyoPxvx/1oMyVL0Xaf4o+T0PJeGy5L+JsLtJgBJ4ZZ XpgunuRXzQZSp1LdTFfIwNJSSOzQmME= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=p8412zaX; spf=pass (imf03.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a26ed1e05c7so1220393266b.2 for ; Tue, 16 Jan 2024 09:35:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; t=1705426515; x=1706031315; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=; b=p8412zaX3PL+ME7XBkWBSlOceHKwejnwdcb4u9SQpsGQsK58fZbMkws81zkHdrNZ0y 3hd/G3NZL6jCHA2NvkG12U775B9fgsHS1FGD+JUQFty+6wFrFMyaAtJoqlPLPSgZIM88 4n+zCUbtssiHTk5bUryKT9gxbPCNoIWaxPpDU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705426515; x=1706031315; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=; b=tAP+P2w2Ju/9sg7TRdvZQg9R74eDRrFHSrRFUnZ2cwOWy7UwPh0SNU/E2hEhFuMP9/ hXmmhErNS1L9Ry1JqqIx8Pwp7jeqEzISEAvTt3tTRiGWmyQrcb9df6bhjm5s7YebJD1u +IOidMIUWx43EXdjTPdSdRhGZK4RFE5x0JKJmmY+RsLT0XuhqIN5BhAvefYxEDVFEZQU B4VtKxETAu9oUpObHw5xv756FB2jRyOk08H3ckAYGvad/1PIAMpGda7ZOGqqGawjvFw8 2K/aLkVt/8Kx/N25YZfBQb2cy8Hc8qo4yeLH8Bc5gZA4bV60h5fI+Wrt2PosfNszFi0C 7Xmg== X-Gm-Message-State: AOJu0Yx/v+MUVju28iOvFZDeYmsGlneh1zAFUi+ifWp7hyw7Mrdu2A7V VETeuFEdJ91Lt6cl5pPeoyk6I/jKChp2rq2DZ/JA9sla9vecF877EtmfJ65AWw== X-Google-Smtp-Source: AGHT+IFR6aFjaw5hAB2TMSAA0FGr2SEeMF5tRxoKMsKhFS2xiy9XO3RGLwInvCcyTRn4QowdWEZi0QsJpRpHQUTJsdA= X-Received: by 2002:a17:907:3688:b0:a2c:86ec:93d5 with SMTP id bi8-20020a170907368800b00a2c86ec93d5mr2707849ejc.126.1705426515098; Tue, 16 Jan 2024 09:35:15 -0800 (PST) MIME-Version: 1.0 References: <7df7e478-bd93-03df-5b10-19308f416e95@quicinc.com> In-Reply-To: From: Jaroslav Pulchart Date: Tue, 16 Jan 2024 18:34:49 +0100 Message-ID: Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU To: Yu Zhao Cc: "Ertman, David M" , Igor Raits , Daniel Secik , Charan Teja Kalla , Kalesh Singh , "akpm@linux-foundation.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C438120014 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: o659hwqndcn6dkeqycizs1pc1q4afafc X-HE-Tag: 1705426516-738125 X-HE-Meta: U2FsdGVkX19XIp9PGV7MnWGqwMMAoKP4Di9gXfvrIas9zifb433ci3DkYIb1+inzj5WA6F38bP1pqPA9tzVxXuOCga/r6vKhNCeGljRVC5U62CyG/2LCM0h4BB4po5D04VHzFzO81OvW0NwizsdNLQDrC1BfdVmCyeg30dQXLrQTZhbm28xXC9PniME4k+7GRb7pF1aWDmbfwuMRYtm/N9YzP5H/OyY7xMb3xNqLscApJe56F3ANRlKBUwS6zfbYazbKVFbi05gNHlGquv/Q0w0OrKEF/xc4VIw6aFbsxGzCX0L8sRyZTYV48UvFVBTlXMxy2bzouElEJrtUsRrg0AdQhTU8I0wCFgi3d/yDI4dJnS4jDDbrACTWDr0NEeCJ3QW8zMRdnPOeQRbCabeIkYqSEUuzlTpmJaOpf72XUqMuAji1BEeleARzAxYLE10es/TPOlPGV+CGi75fLyYgBDxti7nN/NPhxumv4hNNUWzB5N1Flsl7jDhGtO4ZnMUh2nlznVFUNUarEADrGwL5Ij6+IIe0UpinlCncjwmoN0vHx/EC8/g/Nnsbw7NTIxAgsKQdFmpwqQdaycg39s3R7jtu7xpJ+G+b05NdUpY/x/6IkysH5GSw9mePdpchS0a5fdfsYvOh79gCwCadRoI/QhFLoBtOn6RBr73bHpshL5fFERDFqN47dAiGZmd2kXjoGurXBKYKjxjacE3P9a9/+XYlVZxF4XVZERB5DOfez4v5YCo2ZXEEQdtX5JDjDqbCEyQrAIDMYvf6D34qldBOX5GT5LVMowU3E7IxwGZ7lwOy6Blxi6kGpgKnNXEXYu+jGveBZbttb344ishuHMea2Ryp/HqrYrBm9/dW1OmZ0WprP5ig7UvJIjpcoMoplpRPUNTjEUzReoaiiR4pflapNY/h2OyYS9kMsD4swLt0xkIKgbgWrousW/BzEBRw/eIFfEAvorDehp91HjKeP3Q R81skL14 i+Q0unnTd6Waqufaund13eGXfNv4f4CYWE6GXo684J5FYaYKj/8PBjawG0sZFWODftjlvbFgyzpfP3OXtuJZ8lH4nHAHLpSEkgdkJZCxhFymvJbfLiWhGnw0y4/Lj0C1d8APTjVvycZWNjjqCc0hAGPFtzMusff/NjJK1BumsWfctqEPbV0iCPb6ZGT/w34bKTDkY0eCNJrPkPYXOdjYoPtutIRuJupDIM+0lQrm0//wbbA4GstDm0GeiLiOcINv4bn1sRyR3ASYicYnPSQwcDJxsSoQKywvK5siPjlnIN2yamj0Jok2mcD3QArR14FlQUHLg80Bg/I6jtbcYPsTc4T0HBmrt09i3uWLVPjJ/YM3PFUbypwx+H1jpvJj2ATZgHUqmsTArARfQOUibkrTNPmVK7OR43+NgpkehJeKpQb/6mMrJM0IwvAG2KN32Gh9T4GCP9ftlEFDIjLiR9yTx3p2VvkRD0V7VII9CqZEVyAYQoB2vYi33ZHg30syqD91CIp17j0fjYK4UIURGEqFvxSKGk6WcYn79R2RzAOuGpYYJexKYLGhu9k9fedt+bt+WUesIlNaEV11TiFsHbto3uPcQ2E5jVkif77PxulM+jkq3N/s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > On Mon, Jan 8, 2024 at 10:54=E2=80=AFAM Jaroslav Pulchart > wrote: > > > > > > > > > -----Original Message----- > > > > From: Igor Raits > > > > Sent: Thursday, January 4, 2024 3:51 PM > > > > To: Jaroslav Pulchart > > > > Cc: Yu Zhao ; Daniel Secik > > > > ; Charan Teja Kalla > > > > ; Kalesh Singh ; > > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M > > > > > > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pat= tern > > > > with multi-gen LRU > > > > > > > > Hello everyone, > > > > > > > > On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > > > > > Charan, does the fix previously attached seem accepta= ble to > > > > you? Any > > > > > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark'= pages > > > > might > > > > > > > > > > succeed here but can fail in the immediate kswapd sleep= , see > > > > > > > > > > prepare_kswapd_sleep(). This can show up into the incre= ased > > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary > > > > kswapd run time. > > > > > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > > > > > > > > > I do not see any unnecessary kswapd run time, on the cont= rary it is > > > > > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So, in downstream, we have something like for > > > > zone_watermark_ok(): > > > > > > > > > > unsigned long size =3D wmark_pages(zone, mark) + > > > > MIN_LRU_BATCH << 2; > > > > > > > > > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical= value, > > > > may be we > > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mention= ed > > > > reasoning, is > > > > > > > > > > what all I can say for this patch. > > > > > > > > > > > > > > > > > > > > + mark =3D sysctl_numa_balancing_mode & > > > > NUMA_BALANCING_MEMORY_TIERING ? > > > > > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > > > > > + for (i =3D 0; i <=3D sc->reclaim_idx; i++) { > > > > > > > > > > + struct zone *zone =3D lruvec_pgdat(lruv= ec)->node_zones + > > > > i; > > > > > > > > > > + unsigned long size =3D wmark_pages(zone= , mark); > > > > > > > > > > + > > > > > > > > > > + if (managed_zone(zone) && > > > > > > > > > > + !zone_watermark_ok(zone, sc->order,= size, sc- > > > > >reclaim_idx, 0)) > > > > > > > > > > + return false; > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Charan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jaroslav Pulchart > > > > > > > > > Sr. Principal SW Engineer > > > > > > > > > GoodData > > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > today we try to update servers to 6.6.9 which contains the = mglru fixes > > > > > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > > > > > 555 root 20 0 0 0 0 R 99.7 0.= 0 4:32.86 > > > > > > > > kswapd1 > > > > > > > > 554 root 20 0 0 0 0 R 99.3 0.= 0 3:57.76 > > > > > > > > kswapd0 > > > > > > > > 556 root 20 0 0 0 0 R 97.7 0.= 0 3:42.27 > > > > > > > > kswapd2 > > > > > > > > are the changes in upstream different compared to the initi= al patch > > > > > > > > which I tested? > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Jaroslav Pulchart > > > > > > > > > > > > > > Hi Jaroslav, > > > > > > > > > > > > > > My apologies for all the trouble! > > > > > > > > > > > > > > Yes, there is a slight difference between the fix you verifie= d and > > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a sp= ecial > > > > > > > condition which I thought wouldn't affect you. > > > > > > > > > > > > > > Could you try the attached fix again on top of 6.6.9? It remo= ved that > > > > > > > special condition. > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Thanks for prompt response. I did a test with the patch and it = didn't > > > > > > help. The situation is super strange. > > > > > > > > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory util= ization > > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it= is the > > > > > > worst situation, but the kswapd load is visible from 6.6.8. > > > > > > > > > > > > Setup of this server: > > > > > > * 4 chiplets per each sockets, there are 2 sockets > > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to a= void > > > > > > memory pressure however it is even worse now in contrary. > > > > > > > > > > > > kernel 6.6.7: I do not see kswapd usage when application starte= d =3D=3D OK > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > > > > > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application st= arted > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > > > > > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when applicat= ion started > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > > > > > > > > > I run few more combinations, and here are results / findings: > > > > > > > > > > 6.6.7-1 (vanila) =3D=3D OK, no issu= e > > > > > > > > > > 6.6.8-1 (vanila) =3D=3D single kswa= pd 100% ! > > > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issu= e > > > > > 6.6.8-1 (revert four mglru patches) =3D=3D OK, no issu= e > > > > > > > > > > 6.6.9-1 (vanila) =3D=3D four kswapd= 100% !!!! > > > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd= 100% !!!! > > > > > 6.6.9-3 (revert four mglru patches) =3D=3D four kswapd= 100% !!!! > > > > > > > > > > Summary: > > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case = of > > > > > kernel 6.6.8, > > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks no= t to > > > > > be related to mglru patches at all > > > > > > > > I was able to bisect this change and it looks like there is somethi= ng > > > > going wrong with the ice driver=E2=80=A6 > > > > > > > > Usually after booting our server we see something like this. Most o= f > > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA node= s > > > > that have a really low amount of free memory and we don't know why = but > > > > it looks like that in the end causes the constant swap in/out issue= . > > > > With the final bit of the patch you've sent earlier in this thread = it > > > > is almost invisible. > > > > > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > > > MemTotal: 32264 32701 32659 32686 32701 32701 327= 01 32696 > > > > MemFree: 2191 2828 92 292 3344 2916 359= 4 3222 > > > > > > > > > > > > However, after the following patch we see that more NUMA nodes have > > > > such a low amount of memory and that is causing constant reclaimin= g > > > > of memory because it looks like something inside of the kernel ate = all > > > > the memory. This is right after the start of the system as well. > > > > > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > > > MemTotal: 32264 32701 32659 32686 32701 32701 327= 01 32696 > > > > MemFree: 46 59 51 33 3078 3535 270= 8 3511 > > > > > > > > The difference is 18G vs 12G of free memory sum'd across all NUMA > > > > nodes right after boot of the system. If you have some hints on how= to > > > > debug what is actually occupying all that memory, maybe in both cas= es > > > > - would be happy to debug more! > > > > > > > > Dave, would you have any idea why that patch could cause such a boo= st > > > > in memory utilization? > > > > > > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f > > > > Author: Dave Ertman > > > > Date: Mon Dec 11 13:19:28 2023 -0800 > > > > > > > > ice: alter feature support check for SRIOV and LAG > > > > > > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] > > > > > > > > Previously, the ice driver had support for using a handler for = bonding > > > > netdev events to ensure that conflicting features were not allo= wed to be > > > > activated at the same time. While this was still in place, add= itional > > > > support was added to specifically support SRIOV and LAG togethe= r. These > > > > both utilized the netdev event handler, but the SRIOV and LAG f= eature > > > > was > > > > behind a capabilities feature check to make sure the current NV= M has > > > > support. > > > > > > > > The exclusion part of the event handler should be removed since= there are > > > > users who have custom made solutions that depend on the non-exc= lusion > > > > of > > > > features. > > > > > > > > Wrap the creation/registration and cleanup of the event handler= and > > > > associated structs in the probe flow with a feature check so th= at the > > > > only systems that support the full implementation of LAG featur= es will > > > > initialize support. This will leave other systems unhindered w= ith > > > > functionality as it existed before any LAG code was added. > > > > > > Igor, > > > > > > I have no idea why that two line commit would do anything to increase= memory usage by the ice driver. > > > If anything, I would expect it to lower memory usage as it has the po= tential to stop the allocation of memory > > > for the pf->lag struct. > > > > > > DaveE > > > > Hello, > > > > I believe we can track it as two different issues. So I reported the > > ICE driver commit as a email with subject "[REGRESSION] Intel ICE > > Ethernet driver in linux >=3D 6.6.9 triggers extra memory consumption > > and cause continous kswapd* usage and continuous swapping" to > > Jesse Brandeburg > > Tony Nguyen > > intel-wired-lan@lists.osuosl.org > > Dave Ertman > > > > Lets track the mglru here in this email thread. Yu, the kernel build > > with your mglru-fix-6.6.9.patch seem to be OK at least running it for > > 3days without kswapd usage (excluding the ice driver commit). > > Hi Jaroslav, > > Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a > difference? IOW, were you able to reproduce the problem consistently > without it? > > Thanks! Hi Yu, the mglru-fix-6.6.9.patch is needed for all >=3D 6.6.8 till 6.7. I tested the new 6.7 (without mglru-fix) and this kernel is fine as I cannot trigger the problem there. =C3=BAt 16. 1. 2024 v 5:59 odes=C3=ADlatel Yu Zhao naps= al: > > On Mon, Jan 8, 2024 at 10:54=E2=80=AFAM Jaroslav Pulchart > wrote: > > > > > > > > > -----Original Message----- > > > > From: Igor Raits > > > > Sent: Thursday, January 4, 2024 3:51 PM > > > > To: Jaroslav Pulchart > > > > Cc: Yu Zhao ; Daniel Secik > > > > ; Charan Teja Kalla > > > > ; Kalesh Singh ; > > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M > > > > > > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pat= tern > > > > with multi-gen LRU > > > > > > > > Hello everyone, > > > > > > > > On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi yu, > > > > > > > > > > > > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote: > > > > > > > > > > > Charan, does the fix previously attached seem accepta= ble to > > > > you? Any > > > > > > > > > > > additional feedback? Thanks. > > > > > > > > > > > > > > > > > > > > First, thanks for taking this patch to upstream. > > > > > > > > > > > > > > > > > > > > A comment in code snippet is checking just 'high wmark'= pages > > > > might > > > > > > > > > > succeed here but can fail in the immediate kswapd sleep= , see > > > > > > > > > > prepare_kswapd_sleep(). This can show up into the incre= ased > > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary > > > > kswapd run time. > > > > > > > > > > @Jaroslav: Have you observed something like above? > > > > > > > > > > > > > > > > > > I do not see any unnecessary kswapd run time, on the cont= rary it is > > > > > > > > > fixing the kswapd continuous run issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So, in downstream, we have something like for > > > > zone_watermark_ok(): > > > > > > > > > > unsigned long size =3D wmark_pages(zone, mark) + > > > > MIN_LRU_BATCH << 2; > > > > > > > > > > > > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical= value, > > > > may be we > > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mention= ed > > > > reasoning, is > > > > > > > > > > what all I can say for this patch. > > > > > > > > > > > > > > > > > > > > + mark =3D sysctl_numa_balancing_mode & > > > > NUMA_BALANCING_MEMORY_TIERING ? > > > > > > > > > > + WMARK_PROMO : WMARK_HIGH; > > > > > > > > > > + for (i =3D 0; i <=3D sc->reclaim_idx; i++) { > > > > > > > > > > + struct zone *zone =3D lruvec_pgdat(lruv= ec)->node_zones + > > > > i; > > > > > > > > > > + unsigned long size =3D wmark_pages(zone= , mark); > > > > > > > > > > + > > > > > > > > > > + if (managed_zone(zone) && > > > > > > > > > > + !zone_watermark_ok(zone, sc->order,= size, sc- > > > > >reclaim_idx, 0)) > > > > > > > > > > + return false; > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Charan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Jaroslav Pulchart > > > > > > > > > Sr. Principal SW Engineer > > > > > > > > > GoodData > > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > today we try to update servers to 6.6.9 which contains the = mglru fixes > > > > > > > > (from 6.6.8) and the server behaves much much worse. > > > > > > > > > > > > > > > > I got multiple kswapd* load to ~100% imediatelly. > > > > > > > > 555 root 20 0 0 0 0 R 99.7 0.= 0 4:32.86 > > > > > > > > kswapd1 > > > > > > > > 554 root 20 0 0 0 0 R 99.3 0.= 0 3:57.76 > > > > > > > > kswapd0 > > > > > > > > 556 root 20 0 0 0 0 R 97.7 0.= 0 3:42.27 > > > > > > > > kswapd2 > > > > > > > > are the changes in upstream different compared to the initi= al patch > > > > > > > > which I tested? > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Jaroslav Pulchart > > > > > > > > > > > > > > Hi Jaroslav, > > > > > > > > > > > > > > My apologies for all the trouble! > > > > > > > > > > > > > > Yes, there is a slight difference between the fix you verifie= d and > > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a sp= ecial > > > > > > > condition which I thought wouldn't affect you. > > > > > > > > > > > > > > Could you try the attached fix again on top of 6.6.9? It remo= ved that > > > > > > > special condition. > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Thanks for prompt response. I did a test with the patch and it = didn't > > > > > > help. The situation is super strange. > > > > > > > > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory util= ization > > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it= is the > > > > > > worst situation, but the kswapd load is visible from 6.6.8. > > > > > > > > > > > > Setup of this server: > > > > > > * 4 chiplets per each sockets, there are 2 sockets > > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages > > > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to a= void > > > > > > memory pressure however it is even worse now in contrary. > > > > > > > > > > > > kernel 6.6.7: I do not see kswapd usage when application starte= d =3D=3D OK > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696 > > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252 > > > > > > > > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application st= arted > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696 > > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226 > > > > > > > > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when applicat= ion started > > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 > > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28 > > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28 > > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696 > > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944 > > > > > > > > > > I run few more combinations, and here are results / findings: > > > > > > > > > > 6.6.7-1 (vanila) =3D=3D OK, no issu= e > > > > > > > > > > 6.6.8-1 (vanila) =3D=3D single kswa= pd 100% ! > > > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issu= e > > > > > 6.6.8-1 (revert four mglru patches) =3D=3D OK, no issu= e > > > > > > > > > > 6.6.9-1 (vanila) =3D=3D four kswapd= 100% !!!! > > > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd= 100% !!!! > > > > > 6.6.9-3 (revert four mglru patches) =3D=3D four kswapd= 100% !!!! > > > > > > > > > > Summary: > > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case = of > > > > > kernel 6.6.8, > > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks no= t to > > > > > be related to mglru patches at all > > > > > > > > I was able to bisect this change and it looks like there is somethi= ng > > > > going wrong with the ice driver=E2=80=A6 > > > > > > > > Usually after booting our server we see something like this. Most o= f > > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA node= s > > > > that have a really low amount of free memory and we don't know why = but > > > > it looks like that in the end causes the constant swap in/out issue= . > > > > With the final bit of the patch you've sent earlier in this thread = it > > > > is almost invisible. > > > > > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > > > MemTotal: 32264 32701 32659 32686 32701 32701 327= 01 32696 > > > > MemFree: 2191 2828 92 292 3344 2916 359= 4 3222 > > > > > > > > > > > > However, after the following patch we see that more NUMA nodes have > > > > such a low amount of memory and that is causing constant reclaimin= g > > > > of memory because it looks like something inside of the kernel ate = all > > > > the memory. This is right after the start of the system as well. > > > > > > > > NUMA nodes: 0 1 2 3 4 5 6 = 7 > > > > HPTotalGiB: 28 28 28 28 28 28 28 = 28 > > > > HPFreeGiB: 28 28 28 28 28 28 28 = 28 > > > > MemTotal: 32264 32701 32659 32686 32701 32701 327= 01 32696 > > > > MemFree: 46 59 51 33 3078 3535 270= 8 3511 > > > > > > > > The difference is 18G vs 12G of free memory sum'd across all NUMA > > > > nodes right after boot of the system. If you have some hints on how= to > > > > debug what is actually occupying all that memory, maybe in both cas= es > > > > - would be happy to debug more! > > > > > > > > Dave, would you have any idea why that patch could cause such a boo= st > > > > in memory utilization? > > > > > > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f > > > > Author: Dave Ertman > > > > Date: Mon Dec 11 13:19:28 2023 -0800 > > > > > > > > ice: alter feature support check for SRIOV and LAG > > > > > > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ] > > > > > > > > Previously, the ice driver had support for using a handler for = bonding > > > > netdev events to ensure that conflicting features were not allo= wed to be > > > > activated at the same time. While this was still in place, add= itional > > > > support was added to specifically support SRIOV and LAG togethe= r. These > > > > both utilized the netdev event handler, but the SRIOV and LAG f= eature > > > > was > > > > behind a capabilities feature check to make sure the current NV= M has > > > > support. > > > > > > > > The exclusion part of the event handler should be removed since= there are > > > > users who have custom made solutions that depend on the non-exc= lusion > > > > of > > > > features. > > > > > > > > Wrap the creation/registration and cleanup of the event handler= and > > > > associated structs in the probe flow with a feature check so th= at the > > > > only systems that support the full implementation of LAG featur= es will > > > > initialize support. This will leave other systems unhindered w= ith > > > > functionality as it existed before any LAG code was added. > > > > > > Igor, > > > > > > I have no idea why that two line commit would do anything to increase= memory usage by the ice driver. > > > If anything, I would expect it to lower memory usage as it has the po= tential to stop the allocation of memory > > > for the pf->lag struct. > > > > > > DaveE > > > > Hello, > > > > I believe we can track it as two different issues. So I reported the > > ICE driver commit as a email with subject "[REGRESSION] Intel ICE > > Ethernet driver in linux >=3D 6.6.9 triggers extra memory consumption > > and cause continous kswapd* usage and continuous swapping" to > > Jesse Brandeburg > > Tony Nguyen > > intel-wired-lan@lists.osuosl.org > > Dave Ertman > > > > Lets track the mglru here in this email thread. Yu, the kernel build > > with your mglru-fix-6.6.9.patch seem to be OK at least running it for > > 3days without kswapd usage (excluding the ice driver commit). > > Hi Jaroslav, > > Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a > difference? IOW, were you able to reproduce the problem consistently > without it? > > Thanks! --=20 Jaroslav Pulchart Sr. Principal SW Engineer GoodData