From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4D623C47077
	for <linux-mm@archiver.kernel.org>; Tue, 16 Jan 2024 17:35:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BB0416B0098; Tue, 16 Jan 2024 12:35:19 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B602A6B0099; Tue, 16 Jan 2024 12:35:19 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9B25F6B009A; Tue, 16 Jan 2024 12:35:19 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 838046B0098
	for <linux-mm@kvack.org>; Tue, 16 Jan 2024 12:35:19 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id D09CF140656
	for <linux-mm@kvack.org>; Tue, 16 Jan 2024 17:35:18 +0000 (UTC)
X-FDA: 81685875516.25.2B7AF9F
Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49])
	by imf03.hostedemail.com (Postfix) with ESMTP id C438120014
	for <linux-mm@kvack.org>; Tue, 16 Jan 2024 17:35:16 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gooddata.com header.s=google header.b=p8412zaX;
	spf=pass (imf03.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com;
	dmarc=pass (policy=none) header.from=gooddata.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1705426517;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=;
	b=sSyynJqdqYHCLHuC86eHeOknJdelWW29+uAJXuWqEGVpwtFGLzDr2kiZl6sKi+4NoMTOYL
	+2GmoN12W5OpWq7rlYsj2c9jvYXPggZ3t83XbRo2Igq8oJIjXR+YgP9rFKWOysEe6Uon0Z
	UZY/UUfDBIKA3dtYHzJqBPwIpOppizI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705426517; a=rsa-sha256;
	cv=none;
	b=baXJzYxn7PfZUZO3sww+ynG/dIo9njsiuolpUmEIUaVwv51ni5JIkUKsEOkbPZ8S5iddOC
	4pZ4m2KsZSRJrtBG+PKoyUwvdzVsWyoPxvx/1oMyVL0Xaf4o+T0PJeGy5L+JsLtJgBJ4ZZ
	XpgunuRXzQZSp1LdTFfIwNJSSOzQmME=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gooddata.com header.s=google header.b=p8412zaX;
	spf=pass (imf03.hostedemail.com: domain of jaroslav.pulchart@gooddata.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=jaroslav.pulchart@gooddata.com;
	dmarc=pass (policy=none) header.from=gooddata.com
Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a26ed1e05c7so1220393266b.2
        for <linux-mm@kvack.org>; Tue, 16 Jan 2024 09:35:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gooddata.com; s=google; t=1705426515; x=1706031315; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=;
        b=p8412zaX3PL+ME7XBkWBSlOceHKwejnwdcb4u9SQpsGQsK58fZbMkws81zkHdrNZ0y
         3hd/G3NZL6jCHA2NvkG12U775B9fgsHS1FGD+JUQFty+6wFrFMyaAtJoqlPLPSgZIM88
         4n+zCUbtssiHTk5bUryKT9gxbPCNoIWaxPpDU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705426515; x=1706031315;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=L2I3jt2PI4iKwgobYu2Ql33Gcr1GtA3Xkv1uAg3mNzM=;
        b=tAP+P2w2Ju/9sg7TRdvZQg9R74eDRrFHSrRFUnZ2cwOWy7UwPh0SNU/E2hEhFuMP9/
         hXmmhErNS1L9Ry1JqqIx8Pwp7jeqEzISEAvTt3tTRiGWmyQrcb9df6bhjm5s7YebJD1u
         +IOidMIUWx43EXdjTPdSdRhGZK4RFE5x0JKJmmY+RsLT0XuhqIN5BhAvefYxEDVFEZQU
         B4VtKxETAu9oUpObHw5xv756FB2jRyOk08H3ckAYGvad/1PIAMpGda7ZOGqqGawjvFw8
         2K/aLkVt/8Kx/N25YZfBQb2cy8Hc8qo4yeLH8Bc5gZA4bV60h5fI+Wrt2PosfNszFi0C
         7Xmg==
X-Gm-Message-State: AOJu0Yx/v+MUVju28iOvFZDeYmsGlneh1zAFUi+ifWp7hyw7Mrdu2A7V
	VETeuFEdJ91Lt6cl5pPeoyk6I/jKChp2rq2DZ/JA9sla9vecF877EtmfJ65AWw==
X-Google-Smtp-Source: AGHT+IFR6aFjaw5hAB2TMSAA0FGr2SEeMF5tRxoKMsKhFS2xiy9XO3RGLwInvCcyTRn4QowdWEZi0QsJpRpHQUTJsdA=
X-Received: by 2002:a17:907:3688:b0:a2c:86ec:93d5 with SMTP id
 bi8-20020a170907368800b00a2c86ec93d5mr2707849ejc.126.1705426515098; Tue, 16
 Jan 2024 09:35:15 -0800 (PST)
MIME-Version: 1.0
References: <CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com>
 <CAK8fFZ4ZMmvp__J9vsDB8TsX4908G4vDTh2nTkDwJ107LC3Odg@mail.gmail.com>
 <CAOUHufZoYYZyQjXkU7RPedcYpxQJtPV3k1A4e2p-LC0pr28xWQ@mail.gmail.com>
 <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
 <CAOUHufZak2uzgFog8dkwzbfVomdBOz2U+cb+k_FH3jtcxb=WfA@mail.gmail.com>
 <CAK8fFZ7b8B+Cu_3SxcsQUALByyLnKikGNOLuUuMb127jLu385g@mail.gmail.com>
 <CAOUHufYEKhG0fOTOvfWqRMTOV5MWpRdb6k9ABAvsqx-ZBjQzOA@mail.gmail.com>
 <CAK8fFZ5SYBpMU3azxrq-JEdTxTF0OT2UphwTrMma6Bf_Wm3Vsg@mail.gmail.com>
 <CAOUHufaTW7GDvTi0Dj4duk=KO35o4Di5T5pYi2x51Uu-hABBJQ@mail.gmail.com>
 <CAK8fFZ59XrdZPa3K+2UE2MwdRgoV70HuYTRHZejN37nrJvXD3w@mail.gmail.com>
 <CAK8fFZ4AxnE3ZOFDKf4bee81C933xKOBPRkb_V_Uz6ioGxQb8Q@mail.gmail.com>
 <CAOUHufbTPpKz_kdRW+5wbjtZCVGNwVVrzgZ1x==M+NOv-KVHpQ@mail.gmail.com>
 <CAK8fFZ7cJLcWg_Ppi8JeGYVEhzHGL_q5dDKqitRz_D9-118OUQ@mail.gmail.com>
 <CAOUHufYgLrOkVXtJij-MzuVdhmTttgBaNt2V20nSvSUpTbtX_A@mail.gmail.com>
 <7df7e478-bd93-03df-5b10-19308f416e95@quicinc.com> <CAK8fFZ4EipJyMq0sSWwk+OQVCaS9GHZ+b_sOtB7QDghLG9T3xg@mail.gmail.com>
 <CAK8fFZ60xjztFgujhiRhtMzcmEjA-xZUxHK49NvVHE=TYKCRJQ@mail.gmail.com>
 <CAOUHufaTYEuKcgnpjk5C9QgDhiEtnv0B4S8FdARQhN5=T2MPew@mail.gmail.com>
 <CAK8fFZ6_9SieEz_JdOxUXKBpai17XbAHPJUddjig=kQZ0gP4iQ@mail.gmail.com>
 <CAK8fFZ4v3zJXseEDDP5cvArD-eQYwJf-6VQFPPQOphRQ6L-PiA@mail.gmail.com>
 <CA+9S74hj8pgmO6Px5GV7M5AmMv3GCCASi+W=CkAcYAG_vycabg@mail.gmail.com>
 <MW5PR11MB5811948E3445D78F081023ECDD662@MW5PR11MB5811.namprd11.prod.outlook.com>
 <CAK8fFZ6CRk_hXtkcE-5CEqTC5md_-B=6PdbzoMqLSFECWj6sWg@mail.gmail.com> <CAOUHufYsqyNH=7f==dn=Mkg27W_EvdSFywpzj9q4sjw450d52Q@mail.gmail.com>
In-Reply-To: <CAOUHufYsqyNH=7f==dn=Mkg27W_EvdSFywpzj9q4sjw450d52Q@mail.gmail.com>
From: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Date: Tue, 16 Jan 2024 18:34:49 +0100
Message-ID: <CAK8fFZ4jR_LXrQz8Qo=taHO+P6AzwxfgdBJvaehOe+HwEL60uA@mail.gmail.com>
Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern with
 multi-gen LRU
To: Yu Zhao <yuzhao@google.com>
Cc: "Ertman, David M" <david.m.ertman@intel.com>, Igor Raits <igor@gooddata.com>, 
	Daniel Secik <daniel.secik@gooddata.com>, Charan Teja Kalla <quic_charante@quicinc.com>, 
	Kalesh Singh <kaleshsingh@google.com>, 
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: C438120014
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: o659hwqndcn6dkeqycizs1pc1q4afafc
X-HE-Tag: 1705426516-738125
X-HE-Meta: U2FsdGVkX19XIp9PGV7MnWGqwMMAoKP4Di9gXfvrIas9zifb433ci3DkYIb1+inzj5WA6F38bP1pqPA9tzVxXuOCga/r6vKhNCeGljRVC5U62CyG/2LCM0h4BB4po5D04VHzFzO81OvW0NwizsdNLQDrC1BfdVmCyeg30dQXLrQTZhbm28xXC9PniME4k+7GRb7pF1aWDmbfwuMRYtm/N9YzP5H/OyY7xMb3xNqLscApJe56F3ANRlKBUwS6zfbYazbKVFbi05gNHlGquv/Q0w0OrKEF/xc4VIw6aFbsxGzCX0L8sRyZTYV48UvFVBTlXMxy2bzouElEJrtUsRrg0AdQhTU8I0wCFgi3d/yDI4dJnS4jDDbrACTWDr0NEeCJ3QW8zMRdnPOeQRbCabeIkYqSEUuzlTpmJaOpf72XUqMuAji1BEeleARzAxYLE10es/TPOlPGV+CGi75fLyYgBDxti7nN/NPhxumv4hNNUWzB5N1Flsl7jDhGtO4ZnMUh2nlznVFUNUarEADrGwL5Ij6+IIe0UpinlCncjwmoN0vHx/EC8/g/Nnsbw7NTIxAgsKQdFmpwqQdaycg39s3R7jtu7xpJ+G+b05NdUpY/x/6IkysH5GSw9mePdpchS0a5fdfsYvOh79gCwCadRoI/QhFLoBtOn6RBr73bHpshL5fFERDFqN47dAiGZmd2kXjoGurXBKYKjxjacE3P9a9/+XYlVZxF4XVZERB5DOfez4v5YCo2ZXEEQdtX5JDjDqbCEyQrAIDMYvf6D34qldBOX5GT5LVMowU3E7IxwGZ7lwOy6Blxi6kGpgKnNXEXYu+jGveBZbttb344ishuHMea2Ryp/HqrYrBm9/dW1OmZ0WprP5ig7UvJIjpcoMoplpRPUNTjEUzReoaiiR4pflapNY/h2OyYS9kMsD4swLt0xkIKgbgWrousW/BzEBRw/eIFfEAvorDehp91HjKeP3Q
 R81skL14
 i+Q0unnTd6Waqufaund13eGXfNv4f4CYWE6GXo684J5FYaYKj/8PBjawG0sZFWODftjlvbFgyzpfP3OXtuJZ8lH4nHAHLpSEkgdkJZCxhFymvJbfLiWhGnw0y4/Lj0C1d8APTjVvycZWNjjqCc0hAGPFtzMusff/NjJK1BumsWfctqEPbV0iCPb6ZGT/w34bKTDkY0eCNJrPkPYXOdjYoPtutIRuJupDIM+0lQrm0//wbbA4GstDm0GeiLiOcINv4bn1sRyR3ASYicYnPSQwcDJxsSoQKywvK5siPjlnIN2yamj0Jok2mcD3QArR14FlQUHLg80Bg/I6jtbcYPsTc4T0HBmrt09i3uWLVPjJ/YM3PFUbypwx+H1jpvJj2ATZgHUqmsTArARfQOUibkrTNPmVK7OR43+NgpkehJeKpQb/6mMrJM0IwvAG2KN32Gh9T4GCP9ftlEFDIjLiR9yTx3p2VvkRD0V7VII9CqZEVyAYQoB2vYi33ZHg30syqD91CIp17j0fjYK4UIURGEqFvxSKGk6WcYn79R2RzAOuGpYYJexKYLGhu9k9fedt+bt+WUesIlNaEV11TiFsHbto3uPcQ2E5jVkif77PxulM+jkq3N/s=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

>
> On Mon, Jan 8, 2024 at 10:54=E2=80=AFAM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pat=
tern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem accepta=
ble to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark'=
 pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep=
, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the incre=
ased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the cont=
rary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size =3D wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical=
 value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mention=
ed
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > +       mark =3D sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > +       for (i =3D 0; i <=3D sc->reclaim_idx; i++) {
> > > > > > > > > > +               struct zone *zone =3D lruvec_pgdat(lruv=
ec)->node_zones +
> > > > i;
> > > > > > > > > > +               unsigned long size =3D wmark_pages(zone=
, mark);
> > > > > > > > > > +
> > > > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > > > +                   !zone_watermark_ok(zone, sc->order,=
 size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > +                       return false;
> > > > > > > > > > +       }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the =
mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > >     555 root      20   0       0      0      0 R  99.7   0.=
0   4:32.86
> > > > > > > > kswapd1
> > > > > > > >     554 root      20   0       0      0      0 R  99.3   0.=
0   3:57.76
> > > > > > > > kswapd0
> > > > > > > >     556 root      20   0       0      0      0 R  97.7   0.=
0   3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initi=
al patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verifie=
d and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a sp=
ecial
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It remo=
ved that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it =
didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory util=
ization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it=
 is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to a=
void
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application starte=
d =3D=3D OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application st=
arted
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when applicat=
ion started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > >   6.6.7-1  (vanila)                            =3D=3D OK, no issu=
e
> > > > >
> > > > >   6.6.8-1  (vanila)                            =3D=3D single kswa=
pd 100% !
> > > > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issu=
e
> > > > >   6.6.8-1  (revert four mglru patches)         =3D=3D OK, no issu=
e
> > > > >
> > > > >   6.6.9-1  (vanila)                            =3D=3D four kswapd=
 100% !!!!
> > > > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd=
 100% !!!!
> > > > >   6.6.9-3  (revert four mglru patches)         =3D=3D four kswapd=
 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case =
of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks no=
t to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is somethi=
ng
> > > > going wrong with the ice driver=E2=80=A6
> > > >
> > > > Usually after booting our server we see something like this. Most o=
f
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA node=
s
> > > > that have a really low amount of free memory and we don't know why =
but
> > > > it looks like that in the end causes the constant swap in/out issue=
.
> > > > With the final bit of the patch you've sent earlier in this thread =
it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6  =
     7
> > > > HPTotalGiB:     28      28      28      28      28      28      28 =
     28
> > > > HPFreeGiB:      28      28      28      28      28      28      28 =
     28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   327=
01   32696
> > > > MemFree:        2191    2828    92      292     3344    2916    359=
4    3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and  that is causing constant reclaimin=
g
> > > > of memory because it looks like something inside of the kernel ate =
all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6  =
     7
> > > > HPTotalGiB:     28      28      28      28      28      28      28 =
     28
> > > > HPFreeGiB:      28      28      28      28      28      28      28 =
     28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   327=
01   32696
> > > > MemFree:        46      59      51      33      3078    3535    270=
8    3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how=
 to
> > > > debug what is actually occupying all that memory, maybe in both cas=
es
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boo=
st
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date:   Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > >     ice: alter feature support check for SRIOV and LAG
> > > >
> > > >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > >     Previously, the ice driver had support for using a handler for =
bonding
> > > >     netdev events to ensure that conflicting features were not allo=
wed to be
> > > >     activated at the same time.  While this was still in place, add=
itional
> > > >     support was added to specifically support SRIOV and LAG togethe=
r.  These
> > > >     both utilized the netdev event handler, but the SRIOV and LAG f=
eature
> > > > was
> > > >     behind a capabilities feature check to make sure the current NV=
M has
> > > >     support.
> > > >
> > > >     The exclusion part of the event handler should be removed since=
 there are
> > > >     users who have custom made solutions that depend on the non-exc=
lusion
> > > > of
> > > >     features.
> > > >
> > > >     Wrap the creation/registration and cleanup of the event handler=
 and
> > > >     associated structs in the probe flow with a feature check so th=
at the
> > > >     only systems that support the full implementation of LAG featur=
es will
> > > >     initialize support.  This will leave other systems unhindered w=
ith
> > > >     functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase=
 memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the po=
tential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >=3D 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> >     Jesse Brandeburg <jesse.brandeburg@intel.com>
> >     Tony Nguyen <anthony.l.nguyen@intel.com>
> >     intel-wired-lan@lists.osuosl.org
> >     Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!


Hi Yu,

the mglru-fix-6.6.9.patch is needed for all >=3D 6.6.8 till 6.7. I
tested the new 6.7 (without mglru-fix) and this kernel is fine as I
cannot trigger the problem there.


=C3=BAt 16. 1. 2024 v 5:59 odes=C3=ADlatel Yu Zhao <yuzhao@google.com> naps=
al:
>
> On Mon, Jan 8, 2024 at 10:54=E2=80=AFAM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pat=
tern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34=E2=80=AFPM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30=E2=80=AFPM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem accepta=
ble to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark'=
 pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep=
, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the incre=
ased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the cont=
rary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size =3D wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical=
 value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mention=
ed
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > +       mark =3D sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > +       for (i =3D 0; i <=3D sc->reclaim_idx; i++) {
> > > > > > > > > > +               struct zone *zone =3D lruvec_pgdat(lruv=
ec)->node_zones +
> > > > i;
> > > > > > > > > > +               unsigned long size =3D wmark_pages(zone=
, mark);
> > > > > > > > > > +
> > > > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > > > +                   !zone_watermark_ok(zone, sc->order,=
 size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > +                       return false;
> > > > > > > > > > +       }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the =
mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > >     555 root      20   0       0      0      0 R  99.7   0.=
0   4:32.86
> > > > > > > > kswapd1
> > > > > > > >     554 root      20   0       0      0      0 R  99.3   0.=
0   3:57.76
> > > > > > > > kswapd0
> > > > > > > >     556 root      20   0       0      0      0 R  97.7   0.=
0   3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initi=
al patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verifie=
d and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a sp=
ecial
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It remo=
ved that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it =
didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory util=
ization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it=
 is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to a=
void
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application starte=
d =3D=3D OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application st=
arted
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when applicat=
ion started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > >   6.6.7-1  (vanila)                            =3D=3D OK, no issu=
e
> > > > >
> > > > >   6.6.8-1  (vanila)                            =3D=3D single kswa=
pd 100% !
> > > > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) =3D=3D OK, no issu=
e
> > > > >   6.6.8-1  (revert four mglru patches)         =3D=3D OK, no issu=
e
> > > > >
> > > > >   6.6.9-1  (vanila)                            =3D=3D four kswapd=
 100% !!!!
> > > > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) =3D=3D four kswapd=
 100% !!!!
> > > > >   6.6.9-3  (revert four mglru patches)         =3D=3D four kswapd=
 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case =
of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks no=
t to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is somethi=
ng
> > > > going wrong with the ice driver=E2=80=A6
> > > >
> > > > Usually after booting our server we see something like this. Most o=
f
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA node=
s
> > > > that have a really low amount of free memory and we don't know why =
but
> > > > it looks like that in the end causes the constant swap in/out issue=
.
> > > > With the final bit of the patch you've sent earlier in this thread =
it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6  =
     7
> > > > HPTotalGiB:     28      28      28      28      28      28      28 =
     28
> > > > HPFreeGiB:      28      28      28      28      28      28      28 =
     28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   327=
01   32696
> > > > MemFree:        2191    2828    92      292     3344    2916    359=
4    3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and  that is causing constant reclaimin=
g
> > > > of memory because it looks like something inside of the kernel ate =
all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6  =
     7
> > > > HPTotalGiB:     28      28      28      28      28      28      28 =
     28
> > > > HPFreeGiB:      28      28      28      28      28      28      28 =
     28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   327=
01   32696
> > > > MemFree:        46      59      51      33      3078    3535    270=
8    3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how=
 to
> > > > debug what is actually occupying all that memory, maybe in both cas=
es
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boo=
st
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date:   Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > >     ice: alter feature support check for SRIOV and LAG
> > > >
> > > >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > >     Previously, the ice driver had support for using a handler for =
bonding
> > > >     netdev events to ensure that conflicting features were not allo=
wed to be
> > > >     activated at the same time.  While this was still in place, add=
itional
> > > >     support was added to specifically support SRIOV and LAG togethe=
r.  These
> > > >     both utilized the netdev event handler, but the SRIOV and LAG f=
eature
> > > > was
> > > >     behind a capabilities feature check to make sure the current NV=
M has
> > > >     support.
> > > >
> > > >     The exclusion part of the event handler should be removed since=
 there are
> > > >     users who have custom made solutions that depend on the non-exc=
lusion
> > > > of
> > > >     features.
> > > >
> > > >     Wrap the creation/registration and cleanup of the event handler=
 and
> > > >     associated structs in the probe flow with a feature check so th=
at the
> > > >     only systems that support the full implementation of LAG featur=
es will
> > > >     initialize support.  This will leave other systems unhindered w=
ith
> > > >     functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase=
 memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the po=
tential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >=3D 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> >     Jesse Brandeburg <jesse.brandeburg@intel.com>
> >     Tony Nguyen <anthony.l.nguyen@intel.com>
> >     intel-wired-lan@lists.osuosl.org
> >     Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!


--=20
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData