From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBEB1CE8E9A for ; Thu, 24 Oct 2024 17:48:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5DF386B0098; Thu, 24 Oct 2024 13:48:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5680F6B00A1; Thu, 24 Oct 2024 13:48:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E1916B00A2; Thu, 24 Oct 2024 13:48:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1C6296B0098 for ; Thu, 24 Oct 2024 13:48:47 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 89F5B414DA for ; Thu, 24 Oct 2024 17:48:36 +0000 (UTC) X-FDA: 82709230086.09.56777F0 Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52]) by imf08.hostedemail.com (Postfix) with ESMTP id 3BCC716002B for ; Thu, 24 Oct 2024 17:48:32 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=B605cI1C; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729791970; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=; b=cF7buWhH0Wasj9y5a+TBHfX1O+1awOlRcYogCWk67Gi9xa/+1sWYlIlhIf42Qp+Fgr+u8Q ohchP4dx/+LE/kiFtaA4mW43BL7d+xmn5g+7CQ7Nt1R0/bGJTYzOjVUF6/ceOQKqACnqzb 1rfp3l0mslwZ2umMJkaUdU62Pm/B8Rs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729791970; a=rsa-sha256; cv=none; b=xZgyFlSGpEqKaj6hMGD7/VbtiEBb1SvjDMpAzEpJSn73ecrskRCAamHzAurVNFfSvKeSoi TCViDXRr4/dg0yGty7+D4IqMQxYhQkpbSKpzL3968/POULiDKUbdNiXqaSwrT01lZSDOAj Gg2NJzBPgNZ8TRe/Gr9itLAjdabEKOo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=B605cI1C; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-851d2a36e6dso1157594241.0 for ; Thu, 24 Oct 2024 10:48:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729792124; x=1730396924; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=; b=B605cI1CfX+i4RCdNcaxy/r1RRHNqVQ/VxQcBttbpcrowVhv9AVmcVQ970Zzi457P7 oR1di/5LshsZieNUSjjmM1ZR0A1YdIiGCyD9UHanUtD99+5fg7vBv7D6mpbzWpGLwwYJ l5u/i0dc6lXgO9j8O3HPzvVqLQjH2irCHoyRfc9EPmPc6OBu+tgLOoq2F8S1QhO1T6L5 1rAop7XeRFrCg9pVLn2SYYVEcierLwJdJ3JELc5InfoSu3K6D0Tt4JkAshlDH9mVd8RB KromFDrF1eFvcH1kRLQq/E3W+72jLAbowGEt2Z3DOtuJEkIdco5osU6psR4dH/HFqlTA k6lQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729792124; x=1730396924; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=y26VSrA0gvzcuHws0bkRVD6tE0qvL6sGoOvCQVrQ+2Q=; b=qkn4SPjOmS0Ibk8Qx/nf70iLKeCAb3atigwQIeF0kIxomVBPc8gZGpifpFDAF5GLgg uUvJiostm+Er0JirjCUCsUB+SL9e4mAZvncEF+KvmnUo3SEoR7ZK+TpZUcBZAItUnEZl VYJBxw7VtilNnRrDoqHlb2QXbtu482+mOx9Dx/IPsQAJxOkYe6tjRKZZyZztgz5w5u4l gDJmKSmlaMINyP0jqZkYgfihHIyjnT/GLfv4t7AbiH2c2hx6gwWWAxIAazHurJDQOaRF f+vfdBgD7jZJIzm/Mm2PgLMvtd0oYfGEl7gD9UOhgzfDlYgZ3UGxUyY1dlZ77XW86jYn 332Q== X-Forwarded-Encrypted: i=1; AJvYcCVn7hDZaDmNu2N3HPSRXmHr76csIrxo3RGdWkTiAoCKEcQrR6ndo1iCv2RC87f2cnxhP0FzevIwOg==@kvack.org X-Gm-Message-State: AOJu0YxiEzoiQjiHqLHLICV5I1C6xg+6RVUmqempA8dzvF+/Tcl8OYYK W85gMJtwX9DFYLNdbN3C3B5XM/8EpYhjs67PCtNgaamJkHFeY3+7ceT/gHdQmeVkdn/or00NK+Y xFmXEQTHD92HgxE3Y2XBTrT3fYBw= X-Google-Smtp-Source: AGHT+IGV9GbAkt4FHb/2gl8kB1oIGDVhCYRqdYecm3LmyYRUDSRbQl6WYaH0GJ1fpFtDy3v+/YIkAE/90WZif31sxMU= X-Received: by 2002:a05:6122:922:b0:50a:cbdb:b929 with SMTP id 71dfb90a1353d-50feb3002ffmr2884647e0c.2.1729792123560; Thu, 24 Oct 2024 10:48:43 -0700 (PDT) MIME-Version: 1.0 References: <20241023233548.23348-1-21cnbao@gmail.com> <20241024142942.GA279597@cmpxchg.org> In-Reply-To: <20241024142942.GA279597@cmpxchg.org> From: Barry Song <21cnbao@gmail.com> Date: Fri, 25 Oct 2024 06:48:31 +1300 Message-ID: Subject: Re: [RFC 0/4] mm: zswap: add support for zswapin of large folios To: Johannes Weiner Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, chengming.zhou@linux.dev, david@redhat.com, hanchuanhua@oppo.com, kanchana.p.sridhar@intel.com, kernel-team@meta.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, minchan@kernel.org, nphamcs@gmail.com, riel@surriel.com, ryan.roberts@arm.com, senozhatsky@chromium.org, shakeel.butt@linux.dev, v-songbaohua@oppo.com, willy@infradead.org, ying.huang@intel.com, yosryahmed@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 9q38ibksrszewq8bxt8ejj1ubcc6seuw X-Rspamd-Queue-Id: 3BCC716002B X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1729792112-474622 X-HE-Meta: U2FsdGVkX19wMhEbZFVtbegRco3WfL+StrHHxf3fOH6svJdkvd3ijr0WNqugKlxFSY6uQ7QCvVi9PensEQs9o4lYP2VdsRH8jar7ZYc711vnIRyM2DtFUka8/U3fNL2TNAYumj5bxw8loMWZkZuKAL4n4PCNws3BeFF0HDZ9yCZmWvA7ZTMqDx1FlTewRMnQD9zPGWOT4W06eftAlgv4+ueh8Kl/ZK3kuRpEEF9avVIayUu2z4dNJki+LbG/acYUNMriqhGC5a6wZ8IOq9ps9n7bkZYWWBE+GbhCc/agk1SLRn+hKE623dcV7+mcEQM8sn5EIWZArTDXm75BJhjsUq1EX5r/0Vnu4JJ19Kfy1gENggPCntoYBdkdXrLEQGvmzA9JIMzkfZxMWunBD38R4/mR2GuHPmYLsFyZwk0UPJ27jhz8LcF+ILz2L8LLIzv3J1yBHivNs15HgwS+vShnRZyx6zkoXvXQUyZSfh3j+tKY33hECTlBdR3eLY1FS3e/8CJJwbZkMks7Oz/XDcCFjwFsRysyAFTfUMLs7HIcp8/oMRiWVmglq2DrbZX2HjDtRSAFoOMzJcRO91hCi3M4nn1/2js/HwRaHXRUOEDc4DKqNXvEuVTQOOJtITC/x3aNdqUieNHLv6uLAHVUhT3K0Gx652xj1Pa9xCpmx5taYlEyv4a64Vh7uSEVkn0E0Hc6XLzSuWFRWCPYmsID9yEsFLurcmk394yj4JX/KPHe5phY436xdJZ4bj5+HXPQ8JgZJPi1W/Qimqb/Bg4CizpBPiOIX9O1fTaN+j/cVol4jdS6LfbFtmwqlFTAbb8jaCvPTnPfJsXUM2fTbmsXkj9mXqZW1b1hPpAkXw56cxH7A4D7PRHs1NoidLAzmN1ak1CXEAKDvE8wibzQ0CeiZ0l6Ij6bpWaGLZD4XY+Lnh/K/PkYj39i9cCnKWDDS2da1pmvFJPXJwo8U63v/YpHDbI 7A2lnsLs EQ6uhtPX4jxLnH54NF0RG9jdJNA+KRQh/4vX7YKjZlRcBca4GFMjUWs9jN2ALqnLJ3ZoNkMeOYT9/0slebYlZYkqMmPV6Pj90PjelET8XDLjFciuhpXneOnyZDQdvXiw87xeqA9/vCfUTEO591mwf1d/Jn2b0JvKstB6xiYfaBIBZLhLW8p0vG45/Ek59CEUxBhhQ5NUNLl4K7PN/8WJEjco0xEAd172G7vlH3D3ZzmWaPCRExkHFvx65hmacGyk6uG8S/rPuXAZVPntwN4ROw8/38qYeWYCxPLF0rFgCZsPlFDaBFmB9oUN73HmzPp8mpuPOi+q1bSrJaeR/dHpIYl781GQkenr8ZwQWoUnfsmThCF/VP5gd5M5nlbaGVbyeLwpTbT7LWrqpd0+83wb17Ra5vA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Oct 25, 2024 at 3:29=E2=80=AFAM Johannes Weiner wrote: > > On Thu, Oct 24, 2024 at 12:35:48PM +1300, Barry Song wrote: > > On Thu, Oct 24, 2024 at 9:36=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Thu, Oct 24, 2024 at 8:47=E2=80=AFAM Usama Arif wrote: > > > > > > > > > > > > > > > > On 23/10/2024 19:52, Barry Song wrote: > > > > > On Thu, Oct 24, 2024 at 7:31=E2=80=AFAM Usama Arif wrote: > > > > >> > > > > >> > > > > >> > > > > >> On 23/10/2024 19:02, Yosry Ahmed wrote: > > > > >>> [..] > > > > >>>>>> I suspect the regression occurs because you're running an ed= ge case > > > > >>>>>> where the memory cgroup stays nearly full most of the time (= this isn't > > > > >>>>>> an inherent issue with large folio swap-in). As a result, sw= apping in > > > > >>>>>> mTHP quickly triggers a memcg overflow, causing a swap-out. = The > > > > >>>>>> next swap-in then recreates the overflow, leading to a repea= ting > > > > >>>>>> cycle. > > > > >>>>>> > > > > >>>>> > > > > >>>>> Yes, agreed! Looking at the swap counters, I think this is wh= at is going > > > > >>>>> on as well. > > > > >>>>> > > > > >>>>>> We need a way to stop the cup from repeatedly filling to the= brim and > > > > >>>>>> overflowing. While not a definitive fix, the following chang= e might help > > > > >>>>>> improve the situation: > > > > >>>>>> > > > > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > >>>>>> > > > > >>>>>> index 17af08367c68..f2fa0eeb2d9a 100644 > > > > >>>>>> --- a/mm/memcontrol.c > > > > >>>>>> +++ b/mm/memcontrol.c > > > > >>>>>> > > > > >>>>>> @@ -4559,7 +4559,10 @@ int mem_cgroup_swapin_charge_folio(st= ruct folio > > > > >>>>>> *folio, struct mm_struct *mm, > > > > >>>>>> memcg =3D get_mem_cgroup_from_mm(mm); > > > > >>>>>> rcu_read_unlock(); > > > > >>>>>> > > > > >>>>>> - ret =3D charge_memcg(folio, memcg, gfp); > > > > >>>>>> + if (folio_test_large(folio) && mem_cgroup_margin(mem= cg) < > > > > >>>>>> MEMCG_CHARGE_BATCH) > > > > >>>>>> + ret =3D -ENOMEM; > > > > >>>>>> + else > > > > >>>>>> + ret =3D charge_memcg(folio, memcg, gfp); > > > > >>>>>> > > > > >>>>>> css_put(&memcg->css); > > > > >>>>>> return ret; > > > > >>>>>> } > > > > >>>>>> > > > > >>>>> > > > > >>>>> The diff makes sense to me. Let me test later today and get b= ack to you. > > > > >>>>> > > > > >>>>> Thanks! > > > > >>>>> > > > > >>>>>> Please confirm if it makes the kernel build with memcg limit= ation > > > > >>>>>> faster. If so, let's > > > > >>>>>> work together to figure out an official patch :-) The above = code hasn't consider > > > > >>>>>> the parent memcg's overflow, so not an ideal fix. > > > > >>>>>> > > > > >>>> > > > > >>>> Thanks Barry, I think this fixes the regression, and even give= s an improvement! > > > > >>>> I think the below might be better to do: > > > > >>>> > > > > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > >>>> index c098fd7f5c5e..0a1ec55cc079 100644 > > > > >>>> --- a/mm/memcontrol.c > > > > >>>> +++ b/mm/memcontrol.c > > > > >>>> @@ -4550,7 +4550,11 @@ int mem_cgroup_swapin_charge_folio(stru= ct folio *folio, struct mm_struct *mm, > > > > >>>> memcg =3D get_mem_cgroup_from_mm(mm); > > > > >>>> rcu_read_unlock(); > > > > >>>> > > > > >>>> - ret =3D charge_memcg(folio, memcg, gfp); > > > > >>>> + if (folio_test_large(folio) && > > > > >>>> + mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH,= folio_nr_pages(folio))) > > > > >>>> + ret =3D -ENOMEM; > > > > >>>> + else > > > > >>>> + ret =3D charge_memcg(folio, memcg, gfp); > > > > >>>> > > > > >>>> css_put(&memcg->css); > > > > >>>> return ret; > > > > >>>> > > > > >>>> > > > > >>>> AMD 16K+32K THP=3Dalways > > > > >>>> metric mm-unstable mm-unstable + large folio zswa= pin series mm-unstable + large folio zswapin + no swap thrashing fix > > > > >>>> real 1m23.038s 1m23.050s = 1m22.704s > > > > >>>> user 53m57.210s 53m53.437s = 53m52.577s > > > > >>>> sys 7m24.592s 7m48.843s = 7m22.519s > > > > >>>> zswpin 612070 999244 = 815934 > > > > >>>> zswpout 2226403 2347979 = 2054980 > > > > >>>> pgfault 20667366 20481728 = 20478690 > > > > >>>> pgmajfault 385887 269117 = 309702 > > > > >>>> > > > > >>>> AMD 16K+32K+64K THP=3Dalways > > > > >>>> metric mm-unstable mm-unstable + large folio zswa= pin series mm-unstable + large folio zswapin + no swap thrashing fix > > > > >>>> real 1m22.975s 1m23.266s = 1m22.549s > > > > >>>> user 53m51.302s 53m51.069s = 53m46.471s > > > > >>>> sys 7m40.168s 7m57.104s = 7m25.012s > > > > >>>> zswpin 676492 1258573 = 1225703 > > > > >>>> zswpout 2449839 2714767 = 2899178 > > > > >>>> pgfault 17540746 17296555 = 17234663 > > > > >>>> pgmajfault 429629 307495 = 287859 > > > > >>>> > > > > >>> > > > > >>> Thanks Usama and Barry for looking into this. It seems like thi= s would > > > > >>> fix a regression with large folio swapin regardless of zswap. C= an the > > > > >>> same result be reproduced on zram without this series? > > > > >> > > > > >> > > > > >> Yes, its a regression in large folio swapin support regardless o= f zswap/zram. > > > > >> > > > > >> Need to do 3 tests, one with probably the below diff to remove l= arge folio support, > > > > >> one with current upstream and one with upstream + swap thrashing= fix. > > > > >> > > > > >> We only use zswap and dont have a zram setup (and I am a bit laz= y to create one :)). > > > > >> Any zram volunteers to try this? > > > > > > > > > > Hi Usama, > > > > > > > > > > I tried a quick experiment: > > > > > > > > > > echo 1 > /sys/module/zswap/parameters/enabled > > > > > echo 0 > /sys/module/zswap/parameters/enabled > > > > > > > > > > This was to test the zRAM scenario. Enabling zswap even > > > > > once disables mTHP swap-in. :) > > > > > > > > > > I noticed a similar regression with zRAM alone, but the change re= solved > > > > > the issue and even sped up the kernel build compared to the setup= without > > > > > mTHP swap-in. > > > > > > > > Thanks for trying, this is amazing! > > > > > > > > > > However, I=E2=80=99m still working on a proper patch to address t= his. The current > > > > > approach: > > > > > > > > > > mem_cgroup_margin(memcg) < max(MEMCG_CHARGE_BATCH, folio_nr_pages= (folio)) > > > > > > > > > > isn=E2=80=99t sufficient, as it doesn=E2=80=99t cover cases where= group A contains group B, and > > > > > we=E2=80=99re operating within group B. The problem occurs not at= the boundary of > > > > > group B but at the boundary of group A. > > > > > > > > I am not sure I completely followed this. As MEMCG_CHARGE_BATCH=3D6= 4, if we are > > > > trying to swapin a 16kB page, we basically check if atleast 64/4 = =3D 16 folios can be > > > > charged to cgroup, which is reasonable. If we try to swapin a 1M fo= lio, we just > > > > check if we can charge atleast 1 folio. Are you saying that checkin= g just 1 folio > > > > is not enough in this case and can still cause thrashing, i.e we sh= ould check more? > > > > > > My understanding is that cgroups are hierarchical. Even if we don=E2= =80=99t > > > hit the memory > > > limit of the folio=E2=80=99s direct memcg, we could still reach the = limit of > > > one of its parent > > > memcgs. Imagine a structure like: > > > > > > /sys/fs/cgroup/a/b/c/d > > > > > > If we=E2=80=99re compiling the kernel in d, there=E2=80=99s a chance = that while d > > > isn=E2=80=99t at its limit, its > > > parents (c, b, or a) could be. Currently, the check only applies to d= . > > > > To clarify, I mean something like this: > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 17af08367c68..cc6d21848ee8 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -4530,6 +4530,29 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgr= oup *memcg, gfp_t gfp, > > return 0; > > } > > > > +/* > > + * When the memory cgroup is nearly full, swapping in large folios can > > + * easily lead to swap thrashing, as the memcg operates on the edge of > > + * being full. We maintain a margin to allow for quick fallback to > > + * smaller folios during the swap-in process. > > + */ > > +static inline bool mem_cgroup_swapin_margin_protected(struct mem_cgrou= p *memcg, > > + struct folio *folio) > > +{ > > + unsigned int nr; > > + > > + if (!folio_test_large(folio)) > > + return false; > > + > > + nr =3D max_t(unsigned int, folio_nr_pages(folio), MEMCG_CHARGE_BA= TCH); > > + for (; !mem_cgroup_is_root(memcg); memcg =3D parent_mem_cgroup(me= mcg)) { > > + if (mem_cgroup_margin(memcg) < nr) > > + return true; > > + } > > + > > + return false; > > +} > > + > > /** > > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for= swapin. > > * @folio: folio to charge. > > @@ -4547,7 +4570,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *= folio, struct mm_struct *mm, > > { > > struct mem_cgroup *memcg; > > unsigned short id; > > - int ret; > > + int ret =3D -ENOMEM; > > + bool margin_prot; > > > > if (mem_cgroup_disabled()) > > return 0; > > @@ -4557,9 +4581,11 @@ int mem_cgroup_swapin_charge_folio(struct folio = *folio, struct mm_struct *mm, > > memcg =3D mem_cgroup_from_id(id); > > if (!memcg || !css_tryget_online(&memcg->css)) > > memcg =3D get_mem_cgroup_from_mm(mm); > > + margin_prot =3D mem_cgroup_swapin_margin_protected(memcg, folio); > > rcu_read_unlock(); > > > > - ret =3D charge_memcg(folio, memcg, gfp); > > + if (!margin_prot) > > + ret =3D charge_memcg(folio, memcg, gfp); > > > > css_put(&memcg->css); > > return ret; > > I'm not quite following. > > The charging code DOES the margin check. If you just want to avoid > reclaim, pass gfp without __GFP_DIRECT_RECLAIM, and it will return > -ENOMEM if there is no margin. > > alloc_swap_folio() passes the THP mask, which should not include the > reclaim flag per default (GFP_TRANSHUGE_LIGHT). Unless you run with > defrag=3Dalways. Is that what's going on? No, quite sure "defrag=3Dnever" can just achieve the same result. Imagine w= e only have small folios=E2=80=94each time reclamation occurs, we have at least a SWAP_CLUSTER_MAX buffer before the next reclamation is triggered. .nr_to_reclaim =3D max(nr_pages, SWAP_CLUSTER_MAX), However, with large folios, we can quickly exhaust the SWAP_CLUSTER_MAX buffer and reach the next reclamation point. Once we consume SWAP_CLUSTER_MAX - 1, the mem_cgroup_swapin_charge_folio() call for the final small folio with GFP_KERNEL will trigger reclamation. if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { Thanks Barry