From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DBEB1CE8E9A
	for <linux-mm@archiver.kernel.org>; Thu, 24 Oct 2024 17:48:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5DF386B0098; Thu, 24 Oct 2024 13:48:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5680F6B00A1; Thu, 24 Oct 2024 13:48:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3E1916B00A2; Thu, 24 Oct 2024 13:48:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 1C6296B0098
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 13:48:47 -0400 (EDT)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 89F5B414DA
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 17:48:36 +0000 (UTC)
X-FDA: 82709230086.09.56777F0
Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52])
	by imf08.hostedemail.com (Postfix) with ESMTP id 3BCC716002B
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 17:48:32 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=B605cI1C;
	spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729791970;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=;
	b=cF7buWhH0Wasj9y5a+TBHfX1O+1awOlRcYogCWk67Gi9xa/+1sWYlIlhIf42Qp+Fgr+u8Q
	ohchP4dx/+LE/kiFtaA4mW43BL7d+xmn5g+7CQ7Nt1R0/bGJTYzOjVUF6/ceOQKqACnqzb
	1rfp3l0mslwZ2umMJkaUdU62Pm/B8Rs=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729791970; a=rsa-sha256;
	cv=none;
	b=xZgyFlSGpEqKaj6hMGD7/VbtiEBb1SvjDMpAzEpJSn73ecrskRCAamHzAurVNFfSvKeSoi
	TCViDXRr4/dg0yGty7+D4IqMQxYhQkpbSKpzL3968/POULiDKUbdNiXqaSwrT01lZSDOAj
	Gg2NJzBPgNZ8TRe/Gr9itLAjdabEKOo=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=B605cI1C;
	spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-851d2a36e6dso1157594241.0
        for <linux-mm@kvack.org>; Thu, 24 Oct 2024 10:48:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1729792124; x=1730396924; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=;
        b=B605cI1CfX+i4RCdNcaxy/r1RRHNqVQ/VxQcBttbpcrowVhv9AVmcVQ970Zzi457P7
         oR1di/5LshsZieNUSjjmM1ZR0A1YdIiGCyD9UHanUtD99+5fg7vBv7D6mpbzWpGLwwYJ
         l5u/i0dc6lXgO9j8O3HPzvVqLQjH2irCHoyRfc9EPmPc6OBu+tgLOoq2F8S1QhO1T6L5
         1rAop7XeRFrCg9pVLn2SYYVEcierLwJdJ3JELc5InfoSu3K6D0Tt4JkAshlDH9mVd8RB
         KromFDrF1eFvcH1kRLQq/E3W+72jLAbowGEt2Z3DOtuJEkIdco5osU6psR4dH/HFqlTA
         k6lQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1729792124; x=1730396924;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=;
        b=qkn4SPjOmS0Ibk8Qx/nf70iLKeCAb3atigwQIeF0kIxomVBPc8gZGpifpFDAF5GLgg
         uUvJiostm+Er0JirjCUCsUB+SL9e4mAZvncEF+KvmnUo3SEoR7ZK+TpZUcBZAItUnEZl
         VYJBxw7VtilNnRrDoqHlb2QXbtu482+mOx9Dx/IPsQAJxOkYe6tjRKZZyZztgz5w5u4l
         gDJmKSmlaMINyP0jqZkYgfihHIyjnT/GLfv4t7AbiH2c2hx6gwWWAxIAazHurJDQOaRF
         f+vfdBgD7jZJIzm/Mm2PgLMvtd0oYfGEl7gD9UOhgzfDlYgZ3UGxUyY1dlZ77XW86jYn
         332Q==
X-Forwarded-Encrypted: i=1; AJvYcCVn7hDZaDmNu2N3HPSRXmHr76csIrxo3RGdWkTiAoCKEcQrR6ndo1iCv2RC87f2cnxhP0FzevIwOg==@kvack.org
X-Gm-Message-State: AOJu0YxiEzoiQjiHqLHLICV5I1C6xg+6RVUmqempA8dzvF+/Tcl8OYYK
	W85gMJtwX9DFYLNdbN3C3B5XM/8EpYhjs67PCtNgaamJkHFeY3+7ceT/gHdQmeVkdn/or00NK+Y
	xFmXEQTHD92HgxE3Y2XBTrT3fYBw=
X-Google-Smtp-Source: AGHT+IGV9GbAkt4FHb/2gl8kB1oIGDVhCYRqdYecm3LmyYRUDSRbQl6WYaH0GJ1fpFtDy3v+/YIkAE/90WZif31sxMU=
X-Received: by 2002:a05:6122:922:b0:50a:cbdb:b929 with SMTP id
 71dfb90a1353d-50feb3002ffmr2884647e0c.2.1729792123560; Thu, 24 Oct 2024
 10:48:43 -0700 (PDT)
MIME-Version: 1.0
References: <cb3f67c3-e8d3-4398-98c9-d5aee134fd4c@gmail.com>
 <20241023233548.23348-1-21cnbao@gmail.com> <20241024142942.GA279597@cmpxchg.org>
In-Reply-To: <20241024142942.GA279597@cmpxchg.org>
From: Barry Song <21cnbao@gmail.com>
Date: Fri, 25 Oct 2024 06:48:31 +1300
Message-ID: <CAGsJ_4zf=mnEv1LSrvXXtTJjMPKB7SnWB2Jni=6EXEbqxHiL=Q@mail.gmail.com>
Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, 
	chengming.zhou@linux.dev, david@redhat.com, hanchuanhua@oppo.com, 
	kanchana.p.sridhar@intel.com, kernel-team@meta.com, linux-doc@vger.kernel.org, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, minchan@kernel.org, 
	nphamcs@gmail.com, riel@surriel.com, ryan.roberts@arm.com, 
	senozhatsky@chromium.org, shakeel.butt@linux.dev, v-songbaohua@oppo.com, 
	willy@infradead.org, ying.huang@intel.com, yosryahmed@google.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 9q38ibksrszewq8bxt8ejj1ubcc6seuw
X-Rspamd-Queue-Id: 3BCC716002B
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1729792112-474622
X-HE-Meta: U2FsdGVkX19wMhEbZFVtbegRco3WfL+StrHHxf3fOH6svJdkvd3ijr0WNqugKlxFSY6uQ7QCvVi9PensEQs9o4lYP2VdsRH8jar7ZYc711vnIRyM2DtFUka8/U3fNL2TNAYumj5bxw8loMWZkZuKAL4n4PCNws3BeFF0HDZ9yCZmWvA7ZTMqDx1FlTewRMnQD9zPGWOT4W06eftAlgv4+ueh8Kl/ZK3kuRpEEF9avVIayUu2z4dNJki+LbG/acYUNMriqhGC5a6wZ8IOq9ps9n7bkZYWWBE+GbhCc/agk1SLRn+hKE623dcV7+mcEQM8sn5EIWZArTDXm75BJhjsUq1EX5r/0Vnu4JJ19Kfy1gENggPCntoYBdkdXrLEQGvmzA9JIMzkfZxMWunBD38R4/mR2GuHPmYLsFyZwk0UPJ27jhz8LcF+ILz2L8LLIzv3J1yBHivNs15HgwS+vShnRZyx6zkoXvXQUyZSfh3j+tKY33hECTlBdR3eLY1FS3e/8CJJwbZkMks7Oz/XDcCFjwFsRysyAFTfUMLs7HIcp8/oMRiWVmglq2DrbZX2HjDtRSAFoOMzJcRO91hCi3M4nn1/2js/HwRaHXRUOEDc4DKqNXvEuVTQOOJtITC/x3aNdqUieNHLv6uLAHVUhT3K0Gx652xj1Pa9xCpmx5taYlEyv4a64Vh7uSEVkn0E0Hc6XLzSuWFRWCPYmsID9yEsFLurcmk394yj4JX/KPHe5phY436xdJZ4bj5+HXPQ8JgZJPi1W/Qimqb/Bg4CizpBPiOIX9O1fTaN+j/cVol4jdS6LfbFtmwqlFTAbb8jaCvPTnPfJsXUM2fTbmsXkj9mXqZW1b1hPpAkXw56cxH7A4D7PRHs1NoidLAzmN1ak1CXEAKDvE8wibzQ0CeiZ0l6Ij6bpWaGLZD4XY+Lnh/K/PkYj39i9cCnKWDDS2da1pmvFJPXJwo8U63v/YpHDbI
 7A2lnsLs
 EQ6uhtPX4jxLnH54NF0RG9jdJNA+KRQh/4vX7YKjZlRcBca4GFMjUWs9jN2ALqnLJ3ZoNkMeOYT9/0slebYlZYkqMmPV6Pj90PjelET8XDLjFciuhpXneOnyZDQdvXiw87xeqA9/vCfUTEO591mwf1d/Jn2b0JvKstB6xiYfaBIBZLhLW8p0vG45/Ek59CEUxBhhQ5NUNLl4K7PN/8WJEjco0xEAd172G7vlH3D3ZzmWaPCRExkHFvx65hmacGyk6uG8S/rPuXAZVPntwN4ROw8/38qYeWYCxPLF0rFgCZsPlFDaBFmB9oUN73HmzPp8mpuPOi+q1bSrJaeR/dHpIYl781GQkenr8ZwQWoUnfsmThCF/VP5gd5M5nlbaGVbyeLwpTbT7LWrqpd0+83wb17Ra5vA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Oct 25, 2024 at 3:29=E2=80=AFAM Johannes Weiner <hannes@cmpxchg.org=
> wrote:
>
> On Thu, Oct 24, 2024 at 12:35:48PM +1300, Barry Song wrote:
> > On Thu, Oct 24, 2024 at 9:36=E2=80=AFAM Barry Song <21cnbao@gmail.com> =
wrote:
> > >
> > > On Thu, Oct 24, 2024 at 8:47=E2=80=AFAM Usama Arif <usamaarif642@gmai=
l.com> wrote:
> > > >
> > > >
> > > >
> > > > On 23/10/2024 19:52, Barry Song wrote:
> > > > > On Thu, Oct 24, 2024 at 7:31=E2=80=AFAM Usama Arif <usamaarif642@=
gmail.com> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 23/10/2024 19:02, Yosry Ahmed wrote:
> > > > >>> [..]
> > > > >>>>>> I suspect the regression occurs because you're running an ed=
ge case
> > > > >>>>>> where the memory cgroup stays nearly full most of the time (=
this isn't
> > > > >>>>>> an inherent issue with large folio swap-in). As a result, sw=
apping in
> > > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. =
The
> > > > >>>>>> next swap-in then recreates the overflow, leading to a repea=
ting
> > > > >>>>>> cycle.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Yes, agreed! Looking at the swap counters, I think this is wh=
at is going
> > > > >>>>> on as well.
> > > > >>>>>
> > > > >>>>>> We need a way to stop the cup from repeatedly filling to the=
 brim and
> > > > >>>>>> overflowing. While not a definitive fix, the following chang=
e might help
> > > > >>>>>> improve the situation:
> > > > >>>>>>
> > > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > >>>>>>
> > > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644
> > > > >>>>>> --- a/mm/memcontrol.c
> > > > >>>>>> +++ b/mm/memcontrol.c
> > > > >>>>>>
> > > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(st=
ruct folio
> > > > >>>>>> *folio, struct mm_struct *mm,
> > > > >>>>>>                 memcg =3D get_mem_cgroup_from_mm(mm);
> > > > >>>>>>         rcu_read_unlock();
> > > > >>>>>>
> > > > >>>>>> -       ret =3D charge_memcg(folio, memcg, gfp);
> > > > >>>>>> +       if (folio_test_large(folio) && mem_cgroup_margin(mem=
cg) <
> > > > >>>>>> MEMCG_CHARGE_BATCH)
> > > > >>>>>> +               ret =3D -ENOMEM;
> > > > >>>>>> +       else
> > > > >>>>>> +               ret =3D charge_memcg(folio, memcg, gfp);
> > > > >>>>>>
> > > > >>>>>>         css_put(&memcg->css);
> > > > >>>>>>         return ret;
> > > > >>>>>> }
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> The diff makes sense to me. Let me test later today and get b=
ack to you.
> > > > >>>>>
> > > > >>>>> Thanks!
> > > > >>>>>
> > > > >>>>>> Please confirm if it makes the kernel build with memcg limit=
ation
> > > > >>>>>> faster. If so, let's
> > > > >>>>>> work together to figure out an official patch :-) The above =
code hasn't consider
> > > > >>>>>> the parent memcg's overflow, so not an ideal fix.
> > > > >>>>>>
> > > > >>>>
> > > > >>>> Thanks Barry, I think this fixes the regression, and even give=
s an improvement!
> > > > >>>> I think the below might be better to do:
> > > > >>>>
> > > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644
> > > > >>>> --- a/mm/memcontrol.c
> > > > >>>> +++ b/mm/memcontrol.c
> > > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(stru=
ct folio *folio, struct mm_struct *mm,
> > > > >>>>                 memcg =3D get_mem_cgroup_from_mm(mm);
> > > > >>>>         rcu_read_unlock();
> > > > >>>>
> > > > >>>> -       ret =3D charge_memcg(folio, memcg, gfp);
> > > > >>>> +       if (folio_test_large(folio) &&
> > > > >>>> +           mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,=
 folio_nr_pages(folio)))
> > > > >>>> +               ret =3D -ENOMEM;
> > > > >>>> +       else
> > > > >>>> +               ret =3D charge_memcg(folio, memcg, gfp);
> > > > >>>>
> > > > >>>>         css_put(&memcg->css);
> > > > >>>>         return ret;
> > > > >>>>
> > > > >>>>
> > > > >>>> AMD 16K+32K THP=3Dalways
> > > > >>>> metric         mm-unstable      mm-unstable + large folio zswa=
pin series    mm-unstable + large folio zswapin + no swap thrashing fix
> > > > >>>> real           1m23.038s        1m23.050s                     =
              1m22.704s
> > > > >>>> user           53m57.210s       53m53.437s                    =
              53m52.577s
> > > > >>>> sys            7m24.592s        7m48.843s                     =
              7m22.519s
> > > > >>>> zswpin         612070           999244                        =
              815934
> > > > >>>> zswpout        2226403          2347979                       =
              2054980
> > > > >>>> pgfault        20667366         20481728                      =
              20478690
> > > > >>>> pgmajfault     385887           269117                        =
              309702
> > > > >>>>
> > > > >>>> AMD 16K+32K+64K THP=3Dalways
> > > > >>>> metric         mm-unstable      mm-unstable + large folio zswa=
pin series   mm-unstable + large folio zswapin + no swap thrashing fix
> > > > >>>> real           1m22.975s        1m23.266s                     =
             1m22.549s
> > > > >>>> user           53m51.302s       53m51.069s                    =
             53m46.471s
> > > > >>>> sys            7m40.168s        7m57.104s                     =
             7m25.012s
> > > > >>>> zswpin         676492           1258573                       =
             1225703
> > > > >>>> zswpout        2449839          2714767                       =
             2899178
> > > > >>>> pgfault        17540746         17296555                      =
             17234663
> > > > >>>> pgmajfault     429629           307495                        =
             287859
> > > > >>>>
> > > > >>>
> > > > >>> Thanks Usama and Barry for looking into this. It seems like thi=
s would
> > > > >>> fix a regression with large folio swapin regardless of zswap. C=
an the
> > > > >>> same result be reproduced on zram without this series?
> > > > >>
> > > > >>
> > > > >> Yes, its a regression in large folio swapin support regardless o=
f zswap/zram.
> > > > >>
> > > > >> Need to do 3 tests, one with probably the below diff to remove l=
arge folio support,
> > > > >> one with current upstream and one with upstream + swap thrashing=
 fix.
> > > > >>
> > > > >> We only use zswap and dont have a zram setup (and I am a bit laz=
y to create one :)).
> > > > >> Any zram volunteers to try this?
> > > > >
> > > > > Hi Usama,
> > > > >
> > > > > I tried a quick experiment:
> > > > >
> > > > > echo 1 > /sys/module/zswap/parameters/enabled
> > > > > echo 0 > /sys/module/zswap/parameters/enabled
> > > > >
> > > > > This was to test the zRAM scenario. Enabling zswap even
> > > > > once disables mTHP swap-in. :)
> > > > >
> > > > > I noticed a similar regression with zRAM alone, but the change re=
solved
> > > > > the issue and even sped up the kernel build compared to the setup=
 without
> > > > > mTHP swap-in.
> > > >
> > > > Thanks for trying, this is amazing!
> > > > >
> > > > > However, I=E2=80=99m still working on a proper patch to address t=
his. The current
> > > > > approach:
> > > > >
> > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages=
(folio))
> > > > >
> > > > > isn=E2=80=99t sufficient, as it doesn=E2=80=99t cover cases where=
 group A contains group B, and
> > > > > we=E2=80=99re operating within group B. The problem occurs not at=
 the boundary of
> > > > > group B but at the boundary of group A.
> > > >
> > > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=3D6=
4, if we are
> > > > trying to swapin a 16kB page, we basically check if atleast 64/4 =
=3D 16 folios can be
> > > > charged to cgroup, which is reasonable. If we try to swapin a 1M fo=
lio, we just
> > > > check if we can charge atleast 1 folio. Are you saying that checkin=
g just 1 folio
> > > > is not enough in this case and can still cause thrashing, i.e we sh=
ould check more?
> > >
> > > My understanding is that cgroups are hierarchical. Even if we don=E2=
=80=99t
> > > hit the memory
> > >  limit of the folio=E2=80=99s direct memcg, we could still reach the =
limit of
> > > one of its parent
> > > memcgs. Imagine a structure like:
> > >
> > > /sys/fs/cgroup/a/b/c/d
> > >
> > > If we=E2=80=99re compiling the kernel in d, there=E2=80=99s a chance =
that while d
> > > isn=E2=80=99t at its limit, its
> > > parents (c, b, or a) could be. Currently, the check only applies to d=
.
> >
> > To clarify, I mean something like this:
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 17af08367c68..cc6d21848ee8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgr=
oup *memcg, gfp_t gfp,
> >       return 0;
> >  }
> >
> > +/*
> > + * When the memory cgroup is nearly full, swapping in large folios can
> > + * easily lead to swap thrashing, as the memcg operates on the edge of
> > + * being full. We maintain a margin to allow for quick fallback to
> > + * smaller folios during the swap-in process.
> > + */
> > +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgrou=
p *memcg,
> > +             struct folio *folio)
> > +{
> > +     unsigned int nr;
> > +
> > +     if (!folio_test_large(folio))
> > +             return false;
> > +
> > +     nr =3D max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BA=
TCH);
> > +     for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgroup(me=
mcg)) {
> > +             if (mem_cgroup_margin(memcg) < nr)
> > +                     return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +
> >  /**
> >   * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for=
 swapin.
> >   * @folio: folio to charge.
> > @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *=
folio, struct mm_struct *mm,
> >  {
> >       struct mem_cgroup *memcg;
> >       unsigned short id;
> > -     int ret;
> > +     int ret =3D -ENOMEM;
> > +     bool margin_prot;
> >
> >       if (mem_cgroup_disabled())
> >               return 0;
> > @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio =
*folio, struct mm_struct *mm,
> >       memcg =3D mem_cgroup_from_id(id);
> >       if (!memcg || !css_tryget_online(&memcg->css))
> >               memcg =3D get_mem_cgroup_from_mm(mm);
> > +     margin_prot =3D mem_cgroup_swapin_margin_protected(memcg, folio);
> >       rcu_read_unlock();
> >
> > -     ret =3D charge_memcg(folio, memcg, gfp);
> > +     if (!margin_prot)
> > +             ret =3D charge_memcg(folio, memcg, gfp);
> >
> >       css_put(&memcg->css);
> >       return ret;
>
> I'm not quite following.
>
> The charging code DOES the margin check. If you just want to avoid
> reclaim, pass gfp without __GFP_DIRECT_RECLAIM, and it will return
> -ENOMEM if there is no margin.
>
> alloc_swap_folio() passes the THP mask, which should not include the
> reclaim flag per default (GFP_TRANSHUGE_LIGHT). Unless you run with
> defrag=3Dalways. Is that what's going on?

No, quite sure "defrag=3Dnever" can just achieve the same result. Imagine w=
e only
have small folios=E2=80=94each time reclamation occurs, we have at least a
SWAP_CLUSTER_MAX buffer before the next reclamation is triggered.

 .nr_to_reclaim =3D max(nr_pages, SWAP_CLUSTER_MAX),

However, with large folios, we can quickly exhaust the SWAP_CLUSTER_MAX
buffer and reach the next reclamation point.
Once we consume SWAP_CLUSTER_MAX - 1, the mem_cgroup_swapin_charge_folio()
call for the final small folio with GFP_KERNEL will trigger reclamation.
        if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
                                           GFP_KERNEL, entry)) {

Thanks
Barry