From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63CDEE77188
	for <linux-mm@archiver.kernel.org>; Sun, 12 Jan 2025 10:49:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D02B86B0096; Sun, 12 Jan 2025 05:49:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CB2E76B0098; Sun, 12 Jan 2025 05:49:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B2E1B6B0099; Sun, 12 Jan 2025 05:49:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 8AA096B0096
	for <linux-mm@kvack.org>; Sun, 12 Jan 2025 05:49:39 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id ED49212083D
	for <linux-mm@kvack.org>; Sun, 12 Jan 2025 10:49:38 +0000 (UTC)
X-FDA: 82998478836.04.4EFE520
Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179])
	by imf04.hostedemail.com (Postfix) with ESMTP id E245A40002
	for <linux-mm@kvack.org>; Sun, 12 Jan 2025 10:49:36 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FdjREz1U;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf04.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736678977; a=rsa-sha256;
	cv=none;
	b=SoSbWUNLqy8oGV8YNfajRjbh3UL8Z1dI004WP0TBzVWn6Z6UB4GloAwc8zYmsDEmPBN5y9
	E8uegAsiu/iZ0aTU/9nmpx8fJ1nJC0FzMxOHCy/xu7Q25W8A0F5HEZKvIKkDonmmwfz51t
	7RXI2u9Gkv7WjsgxgmBUTZZu0FFv7O0=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FdjREz1U;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf04.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736678977;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=5o6ZhplajUBrYkDkLyWoV8ywLq1ZRQi5DyXQDTIt0uA=;
	b=k6gMCdQw4T6bafL0zEF8VYSpSm5BPQ1A0CnZbuW7gLpGvYRXwTPDSRLfCkqbYj5aCkciUJ
	ua8REE1RanYpJXOB8tlokKiSqjehC5Mp4mNy534vj8i0mpfuzeLvtfMOc5rV5YUs/NCbLc
	dyxrOxaQnjTRZ3Wswg2bR863x/ekK/Q=
Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-2167141dfa1so58884945ad.1
        for <linux-mm@kvack.org>; Sun, 12 Jan 2025 02:49:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736678975; x=1737283775; darn=kvack.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=5o6ZhplajUBrYkDkLyWoV8ywLq1ZRQi5DyXQDTIt0uA=;
        b=FdjREz1UcDKvpE+m/5lUAeQnAFw5B0MA4fDBGi9shdMdbUYbRYRwZuQc9sUA2hl7mM
         JX35RPtrGqFRSbl7wS1hjs7XNTSoUz2CLpMFd5Pyvr/l4jmIsZXc5luK3l+Q8BDRz8vV
         LPsydnjyycQYfo8iz+IRn+mcEAh/oWbzfQS+MjWZJjE3GScRWHFPlAlFs5en6muY4VBl
         KKnxtEbNnuo8FAqEqmDareZ6PStUMmO1c6A0FtcDHSUqDPORpN623JZrm1iRlRAH5CQg
         M6ozJtiBafIyoMk69iRiOr5/Mb6C8+Xgj5wiyqspbGYl4A/wFWglv8WiwxHEKWQfXRll
         NoIA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736678975; x=1737283775;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=5o6ZhplajUBrYkDkLyWoV8ywLq1ZRQi5DyXQDTIt0uA=;
        b=qnMyjYSZzgojyeRAsSwdDzx60GgpF9liiUIOUnIInlArFQMmyxASjj2gY5BP1uIIvS
         d85/aa1ZLjBKMfDUV8f1b85Gg1tFcNlaS4SfN9uuH0pmpEvdbfFLue4WNaCIZf4idc+x
         8Bhp9L5u6kCsdaPZJ+XtjGj7rW2eVKSHE2dLIWY1RZpq5+/4ibsnaMHK6NnudxxYmkSr
         T4hn1FemcL6Ufj5JmFro+qy07ZXJ/gMuomGlLc2VCmWudMSE+RnSAngS/91UpxsQ66jq
         tNvCNC/YlMQ3/Vt7ni6WoXxDooNPJFTVQUqbnJAu+rCK42KlcfOw1fIghGJ95KUaTU/2
         3eXQ==
X-Forwarded-Encrypted: i=1; AJvYcCVOgLRlVAGV/7ZUTb/yLdqVk9h7D9TLOBWEjPaYbiox+s9apexMlKzP/XFwKcMO1XPFU/43Apt17w==@kvack.org
X-Gm-Message-State: AOJu0YxkArqsf2qW5j+Ol7QVWk117vb6pMBxLj0CnbJFG1K5pONMUWdz
	nWSKlyARHSrwj5Qvj1KjJ78L0YDwetnUk8rDzC6jfYhOMuk+Wev/
X-Gm-Gg: ASbGncsbyQnlam8T24GhbeJXXiYb9JDKRS/6w7TJsXZdUW0fl6UPvcSmbRTx1j62fnj
	oXUMSRD5MS/mCikxJPI9QEUoxDQid8jJeOMQvU2ptBjv4cLQ8eArekMBaxrg+2lPhU7BZrvkK12
	ifeKK9F1ZpUphgWW8vPf+gL4jPQfi8FAaFMv1W/9fBcUVfri+E1ptRkCGplupqb09uT5Ms1coUs
	slygZpA+dfRKIJVx3Nxo8T4QGLUl+0usTiORu7dNnGKvLB2LWVR6uqpHX5sAtCAKx0g+nfBOZkG
	mIbcdqY=
X-Google-Smtp-Source: AGHT+IGPwpNTLmoAeNjcT7ilTihwzXsRh2DW5X6Grq4o4eyeHqaTQvNDrv1HmwsLJuLs8EXKfnquGA==
X-Received: by 2002:a17:902:dac9:b0:215:742e:5cff with SMTP id d9443c01a7336-21ad9f7ca15mr70487165ad.16.1736678975244;
        Sun, 12 Jan 2025 02:49:35 -0800 (PST)
Received: from Barrys-MBP.hub ([2407:7000:af65:8200:d8b4:6a5a:8948:aaa])
        by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f12aa9dsm37223265ad.62.2025.01.12.02.49.31
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Sun, 12 Jan 2025 02:49:34 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: 21cnbao@gmail.com,
	usamaarif642@gmail.com
Cc: hannes@cmpxchg.org,
	linux-mm@kvack.org,
	lsf-pc@lists.linux-foundation.org,
	shakeel.butt@linux.dev,
	yosryahmed@google.com,
	Barry Song <v-songbaohua@oppo.com>
Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin
Date: Sun, 12 Jan 2025 23:49:26 +1300
Message-Id: <20250112104926.54924-1-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <CAGsJ_4wUJfgFaFJGmUO5syU6c77FVn_-GnX4p+33ujdBMr3x5g@mail.gmail.com>
References: <CAGsJ_4wUJfgFaFJGmUO5syU6c77FVn_-GnX4p+33ujdBMr3x5g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Stat-Signature: gj8gwy9hdtara5niuud5ripyapyxmrdq
X-Rspam-User: 
X-Rspamd-Queue-Id: E245A40002
X-Rspamd-Server: rspam08
X-HE-Tag: 1736678976-437105
X-HE-Meta: U2FsdGVkX18eJKTg7QXNmGUicwLwvoXYA3kpq4xaJ26IoCLcitfyi1aIxnFA7OWq4vPgr29IbtbAOp9e/ldve8KNc/dyG/7lRg8UllgZFPYHu84fx36/Ng0ijHU83xRGU1AETfzmiCeE+ZiFUJEqWGNhVYs8LSX4rQApWd8wNaQN+0EFJ/cLoUL4/jtr/zCCYva8j2oGEi5VKYWpe+rTpP3TWx0aRrMYX7wvXFl68zbz+0D0aEhwm9RKHXSZJmIvhw3Z8tVnX2HRccZuyUQD5j1csnl5qNADCkiIqC39GLTe29T1InqaM4Lkjh3KwTh3uSgMzWW8s7ptoViQJhmHjJRaDNcb5vwFkCyBCz/h4oA+sI/AY98PlS/0AdQ0XWQvqIa2gLGbOonbzU1uREAZcYpe6a6AZ3/Q7Uq6FKNzXToQZ7fsyUJuy1S1qyvhW/UUkFIbliDaG3spfWcmGSTT1xXB+cM4DpYUVmdWSD+R9W95tGk5IMf+OmS7R8WYmpxht0Btq/+0iU6rwB7RKgi0Xog9aIkH/cMkCPmD5jytWf9e4q5lF90xqVhppEp6B0jlhJxBFoOtvxzesfA4iCcPmQa2x498kJhjlVKN8t3suuorh0mjWQtbZq09ZVJtL9XLJkAV1dEcfE3NfHudRFwO2fdelJZ8ivJcg/l+8JFz8b6RDs+reuzos/MxZrCxjT6A4reQq93PhzREf4pDhHQo2CePxfyOjMlm7zX4BpBH0eX0H5PyfgrVlVBKqm3h5IKgzABYZ7u45aHFfw2zBDPmoftYLT22ZXmhlSKrAfrBy+nPgsGhEyXvnEFUBA3RLn9u0uQ1auXkjyLNgxWortaG9FEYWejgThLJZN2oeh9bwxVRgw8vw6gT6+C/Wh49iXC3ILrF+nS5GlcEkdvu/nxWbYmt+KCppav91crM8rQabUBh2ZUExpGVLV0d22np4OeudTED2G6TZgBGBokhxiH
 fhMsjVtb
 Fu1wrsZO02JKH7u+Ed0pEIGbvIdO5pMAlKy+68FaQvdWLOxCP6izLX7o44Ne9Ae1JHOVCgmqNyslNqcZ60bJeIfkFGKSoD2gYmyqEQQ6eZ/q3/9T7tg4htAQhLQ5bg9hdeWk8CQ4gfKhNdg2U5DWVT3G/5ZH8muvzF2AEKDFf66h0pla3g+09Vn2IDAlm8Cfyfn+2q8YuCYNK9VW9GSQouYzRo+mo3MxaVRhC4/sSWvyy5VgJwwmn4rOtr6WxW3+MHoRm30Zl/KAjipRjkq6fhxqNMvUJ2A75AFYoHRl06jsCdRQ+R3SswTqDqyyjThJ6FoEKAtGy7ma6znG8AwelvVip/P9WJbmLaT55s299MhuSR4PlXk+1ONAMfAXQtXc12a+oubjZs+pVErJvkA0iQf5++z7lwnjQgPFVOY451l5KL/OchCSmaEwkzp5KTMpi06IqEE5RFSXmsoa3F6eyDHwK40jytJl2oxO7PSLBM7MQtXqsFsxAsxWFP9HSCcKcMhOMpf5+QnmOnPn7LcQ6myb8X+sTaTuCGSbNzGCURZRjTtw=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jan 10, 2025 at 11:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Usama,
>
> Please include me in the discussion. I'll try to attend, at least remotely.
>
> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> > I would like to propose a session to discuss the work going on
> > around large folio swapin, whether its traditional swap or
> > zswap or zram.
> >
> > Large folios have obvious advantages that have been discussed before
> > like fewer page faults, batched PTE and rmap manipulation, reduced
> > lru list, TLB coalescing (for arm64 and amd).
> > However, swapping in large folios has its own drawbacks like higher
> > swap thrashing.
> > I had initially sent a RFC of zswapin of large folios in [1]
> > but it causes a regression due to swap thrashing in kernel
> > build time, which I am confident is happening with zram large
> > folio swapin as well (which is merged in kernel).
> >
> > Some of the points we could discuss in the session:
> >
> > - What is the right (preferably open source) benchmark to test for
> > swapin of large folios? kernel build time in limited
> > memory cgroup shows a regression, microbenchmarks show a massive
> > improvement, maybe there are benchmarks where TLB misses is
> > a big factor and show an improvement.
>
> My understanding is that it largely depends on the workload. In interactive
> scenarios, such as on a phone, swap thrashing is not an issue because
> there is minimal to no thrashing for the app occupying the screen
> (foreground). In such cases, swap bandwidth becomes the most critical factor
> in improving app switching speed, especially when multiple applications
> are switching between background and foreground states.
>
> >
> > - We could have something like
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> > to enable/disable swapin but its going to be difficult to tune, might
> > have different optimum values based on workloads and are likely to be
> > left at their default values. Is there some dynamic way to decide when
> > to swapin large folios and when to fallback to smaller folios?
> > swapin_readahead swapcache path which only supports 4K folios atm has a
> > read ahead window based on hits, however readahead is a folio flag and
> > not a page flag, so this method can't be used as once a large folio
> > is swapped in, we won't get a fault and subsequent hits on other
> > pages of the large folio won't be recorded.
> >
> > - For zswap and zram, it might be that doing larger block compression/
> > decompression might offset the regression from swap thrashing, but it
> > brings about its own issues. For e.g. once a large folio is swapped
> > out, it could fail to swapin as a large folio and fallback
> > to 4K, resulting in redundant decompressions.
>
> That's correct. My current workaround involves swapping four small folios,
> and zsmalloc will compress and decompress in chunks of four pages,
> regardless of the actual size of the mTHP - The improvement in compression
> ratio and speed becomes less significant after exceeding four pages, even
> though there is still some increase.
>
> Our recent experiments on phone also show that enabling direct reclamation
> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation
> failure rate -  this probably removes the need for fallbacking to 4 small
> folios. (Note that our experiments include Yu's TAO—Android GKI has
> already merged it. However, since 2 is less than
> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even
> without Yu's TAO, although I have not confirmed this.)
>
> > This will also mean swapin of large folios from traditional swap
> > isn't something we should proceed with?
> >
> > - Should we even support large folio swapin? You often have high swap
> > activity when the system/cgroup is close to running out of memory, at this
> > point, maybe the best way forward is to just swapin 4K pages and let
> > khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> > as well.
>
> This approach might be suitable for non-interactive scenarios, such as building
> a kernel within a memory control group (memcg) or running other server
> applications. However, performing collapse in interactive and power-sensitive
> scenarios would be unnecessary and could lead to wasted power due to
> memory migration and unmap/map operations.
>
> However, it is quite challenging to automatically determine the type
> of workloads
> the system is running. I feel we still need a global control to decide whether
> to enable mTHP swap-in—not necessarily per size, but at least at a global level.
> That said, there is evident resistance to introducing additional
> controls to enable
> or disable mTHP features.

I drafted an approach that eliminates the need for this control. Based on my
testing, it results in even less swap thrashing compared to disabling mTHP
swap-in for the non-mglru case. Here are the results:

real	6m27.227s
user	49m46.751s
sys	3m34.512s
pswpin: 294050
pswpout: 1265556
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 288163
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 22899
pgpgin: 11816316
pgpgout: 13891256
swpout_zero: 136907
swpin_zero: 77215

The draft is as below,

[PATCH RFC] mm: throttle large folios swap-in based on thrashing

We have two types of workloads. The first is interactive systems, where the
foreground desktop apps typically do not swap out. In this case, we are more
concerned with swap bandwidth for switching background and foreground apps,
which is primarily driven by large folio swap-ins.

The second type involves scenarios like building a kernel in a 1GB memory
cgroup, where extensive swapping occurs. Large folio swap-ins can exacerbate
swap thrashing in such cases.

While conceptually, we could use a sysfs control to toggle the mTHP swap-in
feature, there is resistance to adding new controls. Instead, we employ a
simple automatic mechanism to roughly detect swap thrashing: if refaults are
observed in a recent batch of swap-ins, we fall back to small folio swap-ins.
Even during a kernel build in a 1GiB memory cgroup, we continue to observe
many large folio swap-ins, benefiting from increased swap-in bandwidth,
while increased swap thrashing has been eliminated compared to disabling
mTHP swap-in.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/memcontrol.c        | 19 +++++++++++++++++--
 mm/workingset.c        | 37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41894da..c6deece243d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -329,6 +329,15 @@ enum lruvec_flags {
 	LRUVEC_NODE_CONGESTED,
 };
 
+/*
+ * Has the lruvec experienced an anon large folio refault recently?
+ * Once a refault occurs, we set it to 31; it only degrades to 0 if
+ * there are more than 31 consecutive non-refault swap-ins.
+ */
+#define LRUVEC_REFAULT_WIDTH	5
+#define LRUVEC_REFAULT_OFFS	(LRUVEC_NODE_CONGESTED + 1)
+#define LRUVEC_REFAULT_MASK	((BIT(LRUVEC_REFAULT_WIDTH) - 1) << LRUVEC_REFAULT_OFFS)
+
 #endif /* !__GENERATING_BOUNDS_H */
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..4155c4126a80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4556,12 +4556,21 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry)
 {
+	struct pglist_data *pgdat = folio_pgdat(folio);
+	struct lruvec *lruvec;
 	struct mem_cgroup *memcg;
 	unsigned short id;
 	int ret;
 
-	if (mem_cgroup_disabled())
-		return 0;
+	if (mem_cgroup_disabled()) {
+		/*
+		 * lruvec is congested or has recent THP refaults,
+		 * avoid future swap thrashing
+		 */
+		lruvec = &pgdat->__lruvec;
+		return (folio_test_large(folio) && lruvec->flags) ?
+				-ENOMEM : 0;
+	}
 
 	id = lookup_swap_cgroup_id(entry);
 	rcu_read_lock();
@@ -4570,8 +4579,14 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 		memcg = get_mem_cgroup_from_mm(mm);
 	rcu_read_unlock();
 
+	lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio));
+	if (folio_test_large(folio) && lruvec->flags) {
+		ret = -ENOMEM;
+		goto out;
+	}
 	ret = charge_memcg(folio, memcg, gfp);
 
+out:
 	css_put(&memcg->css);
 	return ret;
 }
diff --git a/mm/workingset.c b/mm/workingset.c
index 4841ae8af411..095f8668dc22 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -280,6 +280,28 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 	return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
 }
 
+static void lruvec_set_max_refaults(struct lruvec *lruvec)
+{
+	set_mask_bits(&lruvec->flags, LRUVEC_REFAULT_MASK, LRUVEC_REFAULT_MASK);
+}
+
+static int lruvec_dec_refaults(struct lruvec *lruvec)
+{
+	unsigned long new_flags, old_flags = READ_ONCE(lruvec->flags);
+	unsigned long new_ref, old_ref;
+
+	do {
+		old_ref = (old_flags & LRUVEC_REFAULT_MASK) >> LRUVEC_REFAULT_OFFS;
+		if (old_ref == 0)
+			return 0;
+		new_ref = old_ref - 1;
+		new_flags = old_flags & ~LRUVEC_REFAULT_MASK;
+		new_flags |= new_ref << LRUVEC_REFAULT_OFFS;
+	} while (!try_cmpxchg(&lruvec->flags, &old_flags, new_flags));
+
+	return old_ref;
+}
+
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
 	bool recent;
@@ -299,8 +321,14 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
 
-	if (!recent)
+	if (!recent) {
+		if (!type)
+			lruvec_dec_refaults(lruvec);
 		goto unlock;
+	}
+
+	if (!type && folio_test_large(folio))
+		lruvec_set_max_refaults(lruvec);
 
 	lrugen = &lruvec->lrugen;
 
@@ -563,11 +591,16 @@ void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	if (!workingset_test_recent(shadow, file, &workingset, true))
+	if (!workingset_test_recent(shadow, file, &workingset, true)) {
+		if (!file)
+			lruvec_dec_refaults(lruvec);
 		return;
+	}
 
 	folio_set_active(folio);
 	workingset_age_nonresident(lruvec, nr);
+	if (!file && folio_test_large(folio))
+		lruvec_set_max_refaults(lruvec);
 	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
 
 	/* Folio was active prior to eviction */
-- 
2.34.1


>
> By the way, Usama, have you ever tried switching between mglru and the
> traditional
> active/inactive LRU? My experience shows a significant difference in
> swap thrashing
> —active/inactive LRU exhibits much less swap thrashing in my local kernel build
> tests.
>
> the latest mm-unstable
>
> *********** default mglru:   ***********
>
> root@barry-desktop:/home/barry/develop/linux# ./build.sh
> *** Executing round 1 ***
> real 6m44.561s
> user 46m53.274s
> sys 3m48.585s
> pswpin: 1286081
> pswpout: 3147936
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 714580
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 286881
> pgpgin: 17199072
> pgpgout: 21493892
> swpout_zero: 229163
> swpin_zero: 84353
>
> ******** disable mglru ********
>
> root@barry-desktop:/home/barry/develop/linux# echo 0 >
> /sys/kernel/mm/lru_gen/enabled
>
> root@barry-desktop:/home/barry/develop/linux# ./build.sh
> *** Executing round 1 ***
> real 6m27.944s
> user 46m41.832s
> sys 3m30.635s
> pswpin: 474036
> pswpout: 1434853
> 64kB-swpout: 0
> 32kB-swpout: 0
> 16kB-swpout: 331755
> 64kB-swpin: 0
> 32kB-swpin: 0
> 16kB-swpin: 106333
> pgpgin: 11763720
> pgpgout: 14551524
> swpout_zero: 145050
> swpin_zero: 87981
>
> my build script:
>
> #!/bin/bash
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>
> vmstat_path="/proc/vmstat"
> thp_base_path="/sys/kernel/mm/transparent_hugepage"
>
> read_values() {
>     pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
>     pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
>     pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
>     pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
>     swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
>     swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
>     swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout
> 2>/dev/null || echo 0)
>     swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout
> 2>/dev/null || echo 0)
>     swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout
> 2>/dev/null || echo 0)
>     swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin
> 2>/dev/null || echo 0)
>     swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin
> 2>/dev/null || echo 0)
>     swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin
> 2>/dev/null || echo 0)
>     echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k
> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero
> $swpin_zero"
> }
>
> for ((i=1; i<=1; i++))
> do
>   echo
>   echo "*** Executing round $i ***"
>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>   echo 3 > /proc/sys/vm/drop_caches
>
>   #kernel build
>   initial_values=($(read_values))
>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
>   final_values=($(read_values))
>
>   echo "pswpin: $((final_values[0] - initial_values[0]))"
>   echo "pswpout: $((final_values[1] - initial_values[1]))"
>   echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
>   echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
>   echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
>   echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
>   echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
>   echo "16kB-swpin: $((final_values[7] - initial_values[7]))"
>   echo "pgpgin: $((final_values[8] - initial_values[8]))"
>   echo "pgpgout: $((final_values[9] - initial_values[9]))"
>   echo "swpout_zero: $((final_values[10] - initial_values[10]))"
>   echo "swpin_zero: $((final_values[11] - initial_values[11]))"
>   sync
>   sleep 10
> done
>
> >
> > [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
> > [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/
> > [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Thanks,
> > Usama
>
> Thanks
> Barry