From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E543DD6B6D4
	for <linux-mm@archiver.kernel.org>; Wed, 30 Oct 2024 20:28:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 688936B00B2; Wed, 30 Oct 2024 16:28:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 636E26B00B9; Wed, 30 Oct 2024 16:28:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4FEBA6B00BB; Wed, 30 Oct 2024 16:28:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 2BC3F6B00B2
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 16:28:09 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id ACAC6AC7C4
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 20:28:08 +0000 (UTC)
X-FDA: 82731404154.01.E3E48A1
Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52])
	by imf22.hostedemail.com (Postfix) with ESMTP id 78955C0003
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 20:27:34 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=MneKRGq0;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730320031; a=rsa-sha256;
	cv=none;
	b=mJD1YXtYsls4Xgat1oGv//zZG9/l4CPNCZ8dcp7xitmtkqhSsqghSy6bJBr3puJAkX9dyE
	kKaIXoBImeFKD2IQOhZdSEWGUgnsRj73WDsyKg9yvS/N3a8Os74stYVoqAGXmYz7kOxgf4
	1OYaio/qTflYVfc0kvhTSqKGfnsioAo=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=MneKRGq0;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730320031;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tntdNGl3uw2SIJtVqKTDqTbWHZ0Gqt8b1E2LarEuiJU=;
	b=uehzPdv9JQ0/gr0iBmz1UiNU9HVczo+aXktA/n6cw3byF8WIHh7G+DYo41/5S4Z243B6Wb
	kvVSSsn2Xbwz/saVQfZbgpCkXvEP2PMti7qSfBCnE8y1Wb28V5LSxxTo9mDPFf1YvhAWsD
	N1CX8/w4WDP8ifgw0AQ+PxoSB4Y+ocM=
Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-84fb533bb9bso33546241.3
        for <linux-mm@kvack.org>; Wed, 30 Oct 2024 13:28:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1730320086; x=1730924886; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tntdNGl3uw2SIJtVqKTDqTbWHZ0Gqt8b1E2LarEuiJU=;
        b=MneKRGq0c7VA8/CKxnW82kDym9myzH8yLCQlBZ0Zh6+bCZTAp2efdYfMZ30OrMHMEC
         oS/dxVnFWizVNqDpJUOGnB61zFTJvIhW6zF/c0aZ02PqqlXdXtEQ09w4li1SFvCJHvUZ
         sx0sB6kEZJzMYr73v+RMSTmxb2+QyAg0tPVp2f0HNK5nKvrVE7P22U1IEGXm8QdNyv5g
         FRjFb0gtzRFOm0k38yFvtu7bx6c8Eb0hIq2Khhc7jyRKWDGetMk7kmrkbK8M+1rgSBZt
         ziO6vvJ+yaNon3q0wEsMAvSng83UnB0HgB22xlbV3C4PGoOCaaUarTh5VYLVk53bJuxw
         j2xw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730320086; x=1730924886;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=tntdNGl3uw2SIJtVqKTDqTbWHZ0Gqt8b1E2LarEuiJU=;
        b=CajJMjh+msaXnEXGWjTXwU5y6LvV2J3+AHRii589eZUb98xBxHbKBINIGV8Y6/TQux
         Rcs/cxGSnj7gtAX8kyyvUcCkSBTW9XCjTp+p4Jyy0beYfhDYkxR2q2FQw2ZM2Qrh35bo
         8fCrkVS+VSt1ga2/zJueEIuvPcI497/pjEg7XAHifpaFANSV5p/BQYkOBkxVtOiXsrXt
         Dn3omuYenT4MmKTlFnvQMWOQVBy4yRYm8JgiNSFFPzCrR3Y41CFU43WYxOdq4jv5ra/t
         f4Ojy1S8PsVxBdIJM+HHzPk8VOr1GPPLPx2aRqa6sfP4l6eTWY4bTohSuvTGKnXFV2lu
         qSow==
X-Forwarded-Encrypted: i=1; AJvYcCVMflOSnCzCtUnmW2G/YTUYjOs/RBytHVfcgxQkOBI//QylZb9c/zRgKQy77/wqpAPRXGjLicNn7A==@kvack.org
X-Gm-Message-State: AOJu0Yxd6eP5G8FY1rF6qwF6SHADFnznwfEUqjTy24jF7SLXvRHvM+Q1
	lgSNkBFdfJKb1pGIdwpc+c3sM+M3xiuJgA8xY/i1MqYnACMM+tmBEV+vN8x+DQK/E7wrBelKJeB
	OragwJFv5nQA7FEdsugrcZEHyLKQ=
X-Google-Smtp-Source: AGHT+IGaWrHFrquH2GzryYMch6ZcwNqU7AoaX85bg8goYUn+TMwzfQ4/wJMEHnLptrWXX7st1cj8zMHWZfTwMJPXIFg=
X-Received: by 2002:a05:6102:4189:b0:4a7:4900:4b48 with SMTP id
 ada2fe7eead31-4a8cfd3d7fdmr15419157137.18.1730320085648; Wed, 30 Oct 2024
 13:28:05 -0700 (PDT)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com> <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com>
In-Reply-To: <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 31 Oct 2024 09:27:54 +1300
Message-ID: <CAGsJ_4wx-JH8T5wNjJURKvpQ4hUueMeF9Q6cu9WaFhEc7AEG2A@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: Usama Arif <usamaarif642@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, David Hildenbrand <david@redhat.com>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	Yosry Ahmed <yosryahmed@google.com>, "Huang, Ying" <ying.huang@intel.com>, 
	Kairui Song <kasong@tencent.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeel.butt@linux.dev>, 
	Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 78955C0003
X-Rspamd-Server: rspam01
X-Stat-Signature: uuo8etmbmfdedijpofyc7djjju16n8r8
X-HE-Tag: 1730320054-426930
X-HE-Meta: U2FsdGVkX1+fF7YpSp1I8lwkPPxfKb7qdJxiFf1jXETN8iNMC08awrpfTqcqdmB+rmJaLlEDEr4mT1JJ/NKIuzVS7azxQ7gF+6gGZeQcxUsW04Hr3AbzOSQAmb4aLBvz/TevvRen8XlabIS6Xkas51FQs9Ib30JAXeU0Wv/jsXSbzETyNaQUiLZSxtW/Uz6kWZVgpq012B3uMhqBEOUQBdj9D0GLAzgSmn/gtT9ifISWl6CV82HG4g4PlsRfbXhZWbuzN5VqxbMCWSfd6QpUaoZJHIpZfTu8KYs0esNCxxwJo0bzTwt3pJyePrXGEYAlI3fqun/cG1vMLeT/EbtYmhaWBjSMVEN+dYTswFObZ9Iwumc4D31OT3+59l3/8A3t+TThTLSk8sSCGriaXVyzOi/nFxFxQZRA6NeGvvUJpnl8Z5BKq93WaSiIpwM8vXckSb8/NWp3+RY5vd7n75nMemQkH/2WMweEGn5avzIalp03hu62QFIA9ZKOG+ruIvbiP0zlvQc/hlPKKnjG1i+JeavZ5I6q038YSkY1AmxiOm/Grb1+af9w//RAZaEpQJGdzb3smAvQESKto2A2e8RUgjhLlzNHlS+PMeMlYZ9OpyPUjCwk+pTo8zr3tBCxMLcfEiTT6Q4qJnwXfQU+p7m7ic5WRfB+00c5kX/Pf4tz89obMfQjT6iqfA1BOruwVGQgPhfUh0irjVpk+mWe49IALgUmWA7Cm3vY2SkyK6rg/20qX6i/k+W9RN+Adf6O2+LV1Fh1U2Wnai2XcHCv+5lItIkCeUVeUg5v5zMGm0bpO9/2vQNakUIn02/DnqYvJnJxHYHu7LD7LAiYSB4RVnOLshEGh7AseKaik0CQMJ/vVUpMZSBxsURjAGd149Yw5GOoMWDOtiwSzt9ag6HwFt/OOL+MeaS/qp9T0PIfPXUgOFul553qItlycW6KJUVt5RsHetDsDHD5cDo8cuVw5dQ
 GSa6Pz4j
 4kYHD9D1jPwCJbHdWM42lsVB9333/60QKMybEIHWnLvZ1dOO/FhuWXYB+iW8mvS4hpinlNfG7PEYLan6bZiiZ2q85FJxqcxS9dHicU5e+hCX2kFHNheRT9dNhGHAQxgEJscuYRiNTMcyErIn869OojqlX9wLzEWr6xqByQaynjLvP3HD5+JD1wu3bOaBWTN0SA5/B+rhO6TMI+eG+lexVrClDMT+zNBpGjDB05wRN4Kgd4wlHSBNoNJhi+iV+ue39SKufHCLIzn3b77C3t61YRgprVwr1YvxtLcn3+qrOWJIQ6wp30Nvk+zCMiCvdNTVVnvnjuR/Wg0SAAdJoQIalWGeScogyILuziWfT3vZVxJ57VdpjynsFiZ9uh8zaVpRes8u//+yd5980VmhZ10RrXUOR/dy7ZLzqCdSnIvdJswuN5hFELguxE718cQNl7rpDt+PkzicpZnyOGpAjiaSsUn46LkRRUQuqrLTd2ibN3Kq0CVJ0zrZSFeLohUV4wz6pCoiiYYekzxNXm3bluU6STw+/e0Nnd3OKXAxP1rqVMVpmGcAsTq7CzYxVdMOh1CQNWGfVprOU9u6xJH0Tp+DXNWkHlE/XSeWf//CWY/mM93U/+jyazrKl4hmHEQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Oct 31, 2024 at 3:51=E2=80=AFAM Usama Arif <usamaarif642@gmail.com>=
 wrote:
>
>
>
> On 28/10/2024 22:03, Barry Song wrote:
> > On Mon, Oct 28, 2024 at 8:07=E2=80=AFPM Usama Arif <usamaarif642@gmail.=
com> wrote:
> >>
> >>
> >>
> >> On 27/10/2024 01:14, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> In a memcg where mTHP is always utilized, even at full capacity, it
> >>> might not be the best option. Consider a system that uses only small
> >>> folios: after each reclamation, a process has at least SWAP_CLUSTER_M=
AX
> >>> of buffer space before it can initiate the next reclamation. However,
> >>> large folios can quickly fill this space, rapidly bringing the memcg
> >>> back to full capacity, even though some portions of the large folios
> >>> may not be immediately needed and used by the process.
> >>>
> >>> Usama and Kanchana identified a regression when building the kernel i=
n
> >>> a memcg with memory.max set to a small value while enabling large
> >>> folio swap-in support on zswap[1].
> >>>
> >>> The issue arises from an edge case where the memory cgroup remains
> >>> nearly full most of the time. Consequently, bringing in mTHP can
> >>> quickly cause a memcg overflow, triggering a swap-out. The subsequent
> >>> swap-in then recreates the overflow, resulting in a repetitive cycle.
> >>>
> >>> We need a mechanism to stop the cup from overflowing continuously.
> >>> One potential solution is to slow the filling process when we identif=
y
> >>> that the cup is nearly full.
> >>>
> >>> Usama reported an improvement when we mitigate mTHP swap-in as the
> >>> memcg approaches full capacity[2]:
> >>>
> >>> int mem_cgroup_swapin_charge_folio(...)
> >>> {
> >>>       ...
> >>>       if (folio_test_large(folio) &&
> >>>           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr=
_pages(folio)))
> >>>               ret =3D -ENOMEM;
> >>>       else
> >>>               ret =3D charge_memcg(folio, memcg, gfp);
> >>>       ...
> >>> }
> >>>
> >>> AMD 16K+32K THP=3Dalways
> >>> metric         mm-unstable      mm-unstable + large folio zswapin ser=
ies    mm-unstable + large folio zswapin + no swap thrashing fix
> >>> real           1m23.038s        1m23.050s                            =
       1m22.704s
> >>> user           53m57.210s       53m53.437s                           =
       53m52.577s
> >>> sys            7m24.592s        7m48.843s                            =
       7m22.519s
> >>> zswpin         612070           999244                               =
       815934
> >>> zswpout        2226403          2347979                              =
       2054980
> >>> pgfault        20667366         20481728                             =
       20478690
> >>> pgmajfault     385887           269117                               =
       309702
> >>>
> >>> AMD 16K+32K+64K THP=3Dalways
> >>> metric         mm-unstable      mm-unstable + large folio zswapin ser=
ies   mm-unstable + large folio zswapin + no swap thrashing fix
> >>> real           1m22.975s        1m23.266s                            =
      1m22.549s
> >>> user           53m51.302s       53m51.069s                           =
      53m46.471s
> >>> sys            7m40.168s        7m57.104s                            =
      7m25.012s
> >>> zswpin         676492           1258573                              =
      1225703
> >>> zswpout        2449839          2714767                              =
      2899178
> >>> pgfault        17540746         17296555                             =
      17234663
> >>> pgmajfault     429629           307495                               =
      287859
> >>>
> >>> I wonder if we can extend the mitigation to do_anonymous_page() as
> >>> well. Without hardware like AMD and ARM with hardware TLB coalescing
> >>> or CONT-PTE, I conducted a quick test on my Intel i9 workstation with
> >>> 10 cores and 2 threads. I enabled one 12 GiB zRAM while running kerne=
l
> >>> builds in a memcg with memory.max set to 1 GiB.
> >>>
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/en=
abled
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/en=
abled
> >>> $ echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/en=
abled
> >>> $ echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/e=
nabled
> >>>
> >>> $ time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \
> >>> CROSS_COMPILE=3Daarch64-linux-gnu- Image -10 1>/dev/null 2>/dev/null
> >>>
> >>>             disable-mTHP-swapin  mm-unstable  with-this-patch
> >>> Real:            6m54.595s      7m4.832s       6m45.811s
> >>> User:            66m42.795s     66m59.984s     67m21.150s
> >>> Sys:             12m7.092s      15m18.153s     12m52.644s
> >>> pswpin:          4262327        11723248       5918690
> >>> pswpout:         14883774       19574347       14026942
> >>> 64k-swpout:      624447         889384         480039
> >>> 32k-swpout:      115473         242288         73874
> >>> 16k-swpout:      158203         294672         109142
> >>> 64k-swpin:       0              495869         159061
> >>> 32k-swpin:       0              219977         56158
> >>> 16k-swpin:       0              223501         81445
> >>>
> >>
> >
> > Hi Usama,
> >
> >> hmm, both the user and sys time are worse with the patch compared to
> >> disable-mTHP-swapin. I wonder if the real time is an anomaly and if yo=
u
> >> repeat the experiment the real time might be worse as well?
> >
> > Well, I've improved my script to include a loop:
> >
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enable=
d
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enable=
d
> > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enable=
d
> > echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabl=
ed
> >
> > for ((i=3D1; i<=3D100; i++))
> > do
> >   echo "Executing round $i"
> >   make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev/nul=
l 2>/dev/null
> >   echo 3 > /proc/sys/vm/drop_caches
> >   time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \
> >         CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j15 1>/dev/null 2>/=
dev/null
> >   cat /proc/vmstat | grep pswp
> >   echo -n 64k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout
> >   echo -n 32k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout
> >   echo -n 16k-swpout: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout
> >   echo -n 64k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin
> >   echo -n 32k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin
> >   echo -n 16k-swpin: ; cat
> > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpin
> > done
> >
> > I've noticed that the user/sys/real time on my i9 machine fluctuates
> > constantly, could be things
> > like:
> > real    6m52.087s
> > user    67m12.463s
> > sys     13m8.281s
> > ...
> >
> > real    7m42.937s
> > user    66m55.250s
> > sys     12m56.330s
> > ...
> >
> > real    6m49.374s
> > user    66m37.040s
> > sys     12m44.542s
> > ...
> >
> > real    6m54.205s
> > user    65m49.732s
> > sys     11m33.078s
> > ...
> >
> > likely due to unstable temperatures and I/O latency. As a result, my
> > data doesn=E2=80=99t seem
> > reference-worthy.
> >
>
> So I had suggested retrying the experiment to see how reproducible it is,
> but had not done that myself!
> Thanks for sharing this. I tried many times on the AMD server and I see
> varying numbers as well.
>
> AMD 16K THP always, cgroup =3D 4G, large folio zswapin patches
> real    1m28.351s
> user    54m14.476s
> sys     8m46.596s
> zswpin 811693
> zswpout 2137310
> pgfault 27344671
> pgmajfault 290510
> ..
> real    1m24.557s
> user    53m56.815s
> sys     8m10.200s
> zswpin 571532
> zswpout 1645063
> pgfault 26989075
> pgmajfault 205177
> ..
> real    1m26.083s
> user    54m5.303s
> sys     9m55.247s
> zswpin 1176292
> zswpout 2910825
> pgfault 27286835
> pgmajfault 419746
>
>
> The sys time can especially vary by large numbers. I think you see the sa=
me.
>
>
> > As a phone engineer, we never use phones to run kernel builds. I'm also
> > quite certain that phones won't provide stable and reliable data for th=
is
> > type of workload. Without access to a Linux server to conduct the test,
> > I really need your help.
> >
> > I used to work on optimizing the ARM server scheduler and memory
> > management, and I really miss that machine I had until three years ago =
:-)
> >
> >>
> >>> I need Usama's assistance to identify a suitable patch, as I lack
> >>> access to hardware such as AMD machines and ARM servers with TLB
> >>> optimization.
> >>>
> >>> [1] https://lore.kernel.org/all/b1c17b5e-acd9-4bef-820e-699768f1426d@=
gmail.com/
> >>> [2] https://lore.kernel.org/all/7a14c332-3001-4b9a-ada3-f4d6799be555@=
gmail.com/
> >>>
> >>> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>> Cc: Usama Arif <usamaarif642@gmail.com>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Cc: Chris Li <chrisl@kernel.org>
> >>> Cc: Yosry Ahmed <yosryahmed@google.com>
> >>> Cc: "Huang, Ying" <ying.huang@intel.com>
> >>> Cc: Kairui Song <kasong@tencent.com>
> >>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: Michal Hocko <mhocko@kernel.org>
> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> >>> Cc: Muchun Song <muchun.song@linux.dev>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  include/linux/memcontrol.h |  9 ++++++++
> >>>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++++++++=
++
> >>>  mm/memory.c                | 17 ++++++++++++++
> >>>  3 files changed, 71 insertions(+)
> >>>
> >>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >>> index 524006313b0d..8bcc8f4af39f 100644
> >>> --- a/include/linux/memcontrol.h
> >>> +++ b/include/linux/memcontrol.h
> >>> @@ -697,6 +697,9 @@ static inline int mem_cgroup_charge(struct folio =
*folio, struct mm_struct *mm,
> >>>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gf=
p,
> >>>               long nr_pages);
> >>>
> >>> +int mem_cgroup_precharge_large_folio(struct mm_struct *mm,
> >>> +                             swp_entry_t *entry);
> >>> +
> >>>  int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_st=
ruct *mm,
> >>>                                 gfp_t gfp, swp_entry_t entry);
> >>>
> >>> @@ -1201,6 +1204,12 @@ static inline int mem_cgroup_hugetlb_try_charg=
e(struct mem_cgroup *memcg,
> >>>       return 0;
> >>>  }
> >>>
> >>> +static inline int mem_cgroup_precharge_large_folio(struct mm_struct =
*mm,
> >>> +             swp_entry_t *entry)
> >>> +{
> >>> +     return 0;
> >>> +}
> >>> +
> >>>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio=
,
> >>>                       struct mm_struct *mm, gfp_t gfp, swp_entry_t en=
try)
> >>>  {
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 17af08367c68..f3d92b93ea6d 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -4530,6 +4530,51 @@ int mem_cgroup_hugetlb_try_charge(struct mem_c=
group *memcg, gfp_t gfp,
> >>>       return 0;
> >>>  }
> >>>
> >>> +static inline bool mem_cgroup_has_margin(struct mem_cgroup *memcg)
> >>> +{
> >>> +     for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgroup(=
memcg)) {
> >>> +             if (mem_cgroup_margin(memcg) < HPAGE_PMD_NR)
> >>
> >> There might be 3 issues with the approach:
> >>
> >> Its a very big margin, lets say you have ARM64_64K_PAGES, and you have
> >> 256K THP set to always. As HPAGE_PMD is 512M for 64K page, you are
> >> basically saying you need 512M free memory to swapin just 256K?
> >
> > Right, sorry for the noisy code. I was just thinking about 4KB pages
> > and wondering
> > if we could simplify the code.
> >
> >>
> >> Its an uneven margin for different folio sizes.
> >> For 16K folio swapin, you are checking if there is margin for 128 foli=
os,
> >> but for 1M folio swapin, you are checking there is margin for just 2 f=
olios.
> >>
> >> Maybe it might be better to make this dependent on some factor of foli=
o_nr_pages?
> >
> > Agreed. This is similar to what we discussed regarding your zswap mTHP
> > swap-in series:
> >
> >  int mem_cgroup_swapin_charge_folio(...)
> >  {
> >        ...
> >        if (folio_test_large(folio) &&
> >            mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,
> > folio_nr_pages(folio)))
> >                ret =3D -ENOMEM;
> >        else
> >                ret =3D charge_memcg(folio, memcg, gfp);
> >        ...
> >  }
> >
> > As someone focused on phones, my challenge is the absence of stable pla=
tforms to
> > benchmark this type of workload. If possible, Usama, I would greatly
> > appreciate it if
> > you could take the lead on the patch.
> >
> >>
> >> As Johannes pointed out, the charging code already does the margin che=
ck.
> >> So for 4K, the check just checks if there is 4K available, but for 16K=
 it checks
> >> if a lot more than 16K is available. Maybe there should be a similar p=
olicy for
> >> all? I guess this is similar to my 2nd point, but just considers 4K fo=
lios as
> >> well.
> >
> > I don't think the charging code performs a margin check. It simply
> > tries to charge
> > the specified nr_pages (whether 1 or more). If nr_pages are available,
> > the charge
> > proceeds; otherwise, if GFP allows blocking, it triggers memory reclama=
tion to
> > reclaim max(SWAP_CLUSTER_MAX, nr_pages) base pages.
> >
>
> So if you have defrag not set to always, it will not trigger reclamation.
> I think that is a bigger usecase, i.e. defrag=3Dmadvise,defer,etc is prob=
ably
> used much more then always.
>
> In the current code in that case try_charge_memcg will return -ENOMEM all
> the way to mem_cgroup_swapin_charge_folio and alloc_swap_folio will then
> try the next order. So eventhough it might not be calling the mem_cgroup_=
margin
> function, it is kind of is doing the same?
>
> > If, after reclamation, we have exactly SWAP_CLUSTER_MAX pages available=
, a
> > large folio with nr_pages =3D=3D SWAP_CLUSTER_MAX will successfully cha=
rge,
> > immediately filling the memcg.
> >
> > Shortly after, smaller folios=E2=80=94typically with blockable GFP=E2=
=80=94will quickly trigger
> > additional reclamation. While nr_pages - 1 subpages of the large folio =
may not
> > be immediately needed, they still occupy enough space to fill the memcg=
 to
> > capacity.
> >
> > My second point about the mitigation is as follows: For a system (or
> > memcg) under severe memory pressure, especially one without hardware TL=
B
> > optimization, is enabling mTHP always the right choice? Since mTHP oper=
ates at
> > a larger granularity, some internal fragmentation is unavoidable, regar=
dless
> > of optimization. Could the mitigation code help in automatically tuning
> > this fragmentation?
> >
>
> I agree with the point that enabling mTHP always is not the right thing t=
o do
> on all platforms. I also think it might be the case that enabling mTHP
> might be a good thing for some workloads, but enabling mTHP swapin along =
with
> it might not.
>
> As you said when you have apps switching between foreground and backgroun=
d
> in android, it probably makes sense to have large folio swapping, as you
> want to bringin all the pages from background app as quickly as possible.
> And also all the TLB optimizations and smaller lru overhead you get after
> you have brought in all the pages.
> Linux kernel build test doesnt really get to benefit from the TLB optimiz=
ation
> and smaller lru overhead, as probably the pages are very short lived. So =
I
> think it doesnt show the benefit of large folio swapin properly and
> large folio swapin should probably be disabled for this kind of workload,
> eventhough mTHP should be enabled.

I'm not entirely sure if this applies to platforms without TLB
optimization, especially
in the absence of swap. In a memory-limited cgroup without swap, would
mTHP still
cause significant thrashing of file-backed folios? When a large swap
file is present,
the inability to swap in mTHP seems to act as a workaround for fragmentatio=
n,
allowing fragmented pages of the original mTHP from do_anonymous_page() to
remain in swap.

>
> I am not sure that the approach we are trying in this patch is the right =
way:
> - This patch makes it a memcg issue, but you could have memcg disabled an=
d
> then the mitigation being tried here wont apply.
> - Instead of this being a large folio swapin issue, is it more of a reada=
head
> issue? If we zswap (without the large folio swapin series) and change the=
 window
> to 1 in swap_vma_readahead, we might see an improvement in linux kernel b=
uild time
> when cgroup memory is limited as readahead would probably cause swap thra=
shing as
> well.
> - Instead of looking at cgroup margin, maybe we should try and look at
> the rate of change of workingset_restore_anon? This might be a lot more c=
omplicated
> to do, but probably is the right metric to determine swap thrashing. It a=
lso means
> that this could be used in both the synchronous swapcache skipping path a=
nd
> swapin_readahead path.
> (Thanks Johannes for suggesting this)
>
> With the large folio swapin, I do see the large improvement when consider=
ing only
> swapin performance and latency in the same way as you saw in zram.
> Maybe the right short term approach is to have
> /sys/kernel/mm/transparent_hugepage/swapin
> and have that disabled by default to avoid regression.

A crucial component is still missing=E2=80=94managing the compression and d=
ecompression
of multiple pages as a larger block. This could significantly reduce
system time and
potentially resolve the kernel build issue within a small memory
cgroup, even with
swap thrashing.

I=E2=80=99ll send an update ASAP so you can rebase for zswap.

> If the workload owner sees a benefit, they can enable it.
> I can add this when sending the next version of large folio zswapin if th=
at makes
> sense?
> Longer term I can try and have a look at if we can do something with
> workingset_restore_anon to improve things.
>
> Thanks,
> Usama

Thanks
Barry