From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7E15EE77188
	for <linux-mm@archiver.kernel.org>; Fri, 10 Jan 2025 10:10:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EAA2F8D0002; Fri, 10 Jan 2025 05:10:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E589C8D0001; Fri, 10 Jan 2025 05:10:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CF9BF8D0002; Fri, 10 Jan 2025 05:10:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id AFC288D0001
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 05:10:06 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 55E23A0977
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 10:10:06 +0000 (UTC)
X-FDA: 82991121612.01.E43CCEB
Received: from mail-ua1-f45.google.com (mail-ua1-f45.google.com [209.85.222.45])
	by imf03.hostedemail.com (Postfix) with ESMTP id 5D5AC20008
	for <linux-mm@kvack.org>; Fri, 10 Jan 2025 10:10:04 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=OIE6nzsi;
	spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736503804;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Sbki+p/8/lhD8779Dn5ys+hl8TQpAyUZN3YG/ODF5j4=;
	b=QcfUZptzw0y69ydikISwRRjmjJq6MBjroSKE/OhWJUdIjnDEd5FzNXh3TDfAq4zH8/MrOM
	DO/UL5dcotCnnWr7X1fGS0R73BSGyFaQ80oHHyzVY3u1U7JC2Oagkpf9FStIcm0V0zxCfN
	fJDar9g9R2uE2VGoZHZcML+2P4nJKf0=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=OIE6nzsi;
	spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736503804; a=rsa-sha256;
	cv=none;
	b=BN64H9EdJxPawcQgRDlsisjIkzOh7mKKK//DaQpicJe7nj5CmNzqQChZbytQD9HTBSh+u+
	QNlGntppdCxKbs9jXzLxqjeclY3rtqlQzSB8rA4Z41cc8YueTFBkYqlpUVgTc4ud891XkM
	ra03+Odn0husyOeb+B/jJt6/uAI0h2g=
Received: by mail-ua1-f45.google.com with SMTP id a1e0cc1a2514c-85c4d855fafso458942241.2
        for <linux-mm@kvack.org>; Fri, 10 Jan 2025 02:10:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736503803; x=1737108603; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Sbki+p/8/lhD8779Dn5ys+hl8TQpAyUZN3YG/ODF5j4=;
        b=OIE6nzsi4oapE4AJjKnCi59hQ3Zoc7omXugoyJvR7fng4tM+QlA8oTW4CtKA5mTr3n
         e6VXTVK0Wq0ynbVn6xEMoMNCY9v+PeH8c1bERwk284sAoYkJVKox7V/HMiU+YA/Ttx4a
         6TW7ickM7/BluKDMnpVkH4SbgG+5IRgx85S6BlovSRaoiAZ8zmohZxVMtDsEDi/x7Y7u
         mpbL5XR/WiZsod8iacBEdYz65zNoGyd/Ink7CZ6n4mLjzw0mwdaZJiMJCMLeOdEtTYHm
         GQ+YPX5LwDntkjw0MBDdbQUru/eKDLBkwpkJcmThSRn1g+BL5K22r6WiQmG6nQgPVTh6
         Z5ag==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736503803; x=1737108603;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Sbki+p/8/lhD8779Dn5ys+hl8TQpAyUZN3YG/ODF5j4=;
        b=jCwze0MlMbubejo2aNTQNodQ6lhFhgHwTHoCConJ2YqAlcg2n4Xn61McO9DNEi86bY
         2zLbhz1TZgu9pip+pu6i4lcn7maFFMJKGyKuAXFzqJjEOxthkQ+0z1IJUrCQ2dXehqIu
         KG6fT2aMWLYMQoZq3yHZamBwFFf/mC1f1t/wXSt7gw2pBqlRKNy+CKYwX/EFysQRTTXM
         WIO24/JkSV5N2jll2Qucthr3031K6J4oDtu3gW0rn2uaKn855iKOdbKwFXTM5Qs6hgIa
         2oWTot8ESEK5mmq6C2wDwLJSKY1/HQKSOBs82fZr1Dcj16HQSZSUJXel92Pk4JPoMiLD
         U7ug==
X-Forwarded-Encrypted: i=1; AJvYcCUVzxiO1EtqCRf43XJq/6RJ05tJfqAT3YVWqjWFHGaK4wfqLHgXPBZGCdvuLhwbBkL86hD+MBcEUA==@kvack.org
X-Gm-Message-State: AOJu0YxXqroCTOqyoFBVOBuOpdJz0110bELqzveHLSyeMiRCbDHzMV/Z
	0EW8JcNULO2uZL3dJw7LOGRxOibnKUs9wTb43z60kGuF5iQHc9s3PQ+A1bphmZxNqYs8WzXINuS
	59+D52SkDt5FLVH/t7dipuVYQrRI=
X-Gm-Gg: ASbGncuPD6gczqOZfhcp22jePpmZh0CgqH0AI0O1hW4V14+kQ3hujB8yubaxAiy7sqf
	Oo29e5TNmfNk1B1Q86QHhWAloTjyfAG61QwOENTZPSOqNvuJadVYEmw/3ASqq/+yf3c/9LwaM
X-Google-Smtp-Source: AGHT+IEeYM8hTxkRGvPwz6fhNlA1m6g4Miww6Sl4ZywBE6LedZ33O01hR5Fi8HxTh1k45VY/2LnEUcsEutPitCFuiOc=
X-Received: by 2002:a05:6102:6e88:b0:4b4:6988:b140 with SMTP id
 ada2fe7eead31-4b46988b37cmr5880423137.23.1736503803282; Fri, 10 Jan 2025
 02:10:03 -0800 (PST)
MIME-Version: 1.0
References: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com>
In-Reply-To: <58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com>
From: Barry Song <21cnbao@gmail.com>
Date: Fri, 10 Jan 2025 23:09:52 +1300
X-Gm-Features: AbW1kvYpqlZE3ie_x_q4dp3q2B4Up3HuqzErc2BQvV92wJoesDYzbIUFuFN4O_A
Message-ID: <CAGsJ_4wUJfgFaFJGmUO5syU6c77FVn_-GnX4p+33ujdBMr3x5g@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin
To: Usama Arif <usamaarif642@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, 
	Linux Memory Management List <linux-mm@kvack.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Yosry Ahmed <yosryahmed@google.com>, Shakeel Butt <shakeel.butt@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 5D5AC20008
X-Stat-Signature: hqae7n4bkr4wc3xpymgu67sz7kzdxcqa
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-HE-Tag: 1736503804-764105
X-HE-Meta: U2FsdGVkX19NyMf5s7YRZB1mncqCFYy8Fce9uUYvQDPBF7R1oGAsuIHeJoazBQXiD/OgG7v2TghB7H4Mibb9BQ7gwel1nQ12mD3etH3H8yiIKTFh+/Wp+v+hD9A4It+qUqxwWpNuF97IUGAI0l4PP4HWB4vX+kNGvBA9bF+PlDTKGuKZUzGAAovugXbQhMUNnmakuuQ2P8Gqj6JoFbnr0JHjArZZpEUANDP3Qb98qQMrdRv6JUi5MOPVr00PiyD65B5sNy8vK0tOcgCypQ0zLMs7Z7nbEpxrbstmXKzqBlw0s8Y5CfARmNSzTB6iYQIQm4Nsp113Yu1APpDsHxizSGtd7wvTwF/rSVmkDZybkAtY6CheY29CMsIf/B8fFCsB18WP8GfpegXuxYIPyZpOmwfmLPcpyzPyTcFzvKBLnsq5/7D5b/WDBAo4W+DI2sdfDh4/oB6EOlymCn4fwOnQCFRyOwuwlszfUPCRDc2/F+sf+kuudZcx9q/qdxAtSWGTJ5IPvN9MsIbjvU2fkyQsEaAIk4pw7B/uvNA+Uku0AyOQfVFeTNElpDsN7+ksZZuzIBrfq9bTx28lrSp1Uix+7P4TT4PJN2nS02jzlRE0WIAqtC+gPlk54/WpmZrwi5GAxKYS0IEnfNATdKzKUOREATAy0AAMtyzJD2omZOoS3YUz2Fsdazm/JeE22XHAUSufGz1CqDnEabyo9FTIGqKgRce8klRRTmTZdpBXhBjTzOYQe4VXWjE0OtzRzBaMAxykADSyOYfzx7hyIhw7FqWaPSfdpj9WTFZ4loTMZKrvlKm51s8MPTTeFiDa7zdDNnd4t3wL5pyUfsOudyz2aWYTf9nIGC78EZPcb0xAngqSAYDaNnYffihmQZo1JnX1NLGgRi1mfxI7yhRXENMd6S60PusncTe6i6TE/0aVvObq124cVJAdtozNBU5vpK5VZZmEC/C/BP5vSaI65agE4Gy
 1IMVANTp
 /BkFQCRPN7o09OxlTIETmcWHmwgF+5InkSuHG2YpBepXRMmqhGN6OUsXrg8db1v17Lp4KJybo4zEqO/VMoOJ66u9f29dnD6hW1MgXEZHaKCljoL3RzIgEyNv3dtKEIjDQghoVRFiLfLMrfiZJpeFTP7b81IqoQDO+yBTsHx9ZJOc5iONvhjnVmIfy5bnSGozRMxFj39+vwnq9JjAV6qum1c0KIbYSziTy6cvI3pYsM2TqrGPNoM5UEVERZGtfFpf7tj3W5LodLAIysG586DMWxm4OzinNKCyy0P42ZYL+c6IBDJdaahXfEt2cUCcofilOfwu8WBHa3FNyq9VRBdomx2DR7GyiZ0iZBmp122jZ5rm9e5p0D0S8o7OXs0UK42t9scdOLU5X4dlvjTmMAZSwpkCzo+Dq3ugGQd4tk2kfLYkqr2YIxCchnO9TDT8WICpQLl3U
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Usama,

Please include me in the discussion. I'll try to attend, at least remotely.

On Fri, Jan 10, 2025 at 9:06=E2=80=AFAM Usama Arif <usamaarif642@gmail.com>=
 wrote:
>
> I would like to propose a session to discuss the work going on
> around large folio swapin, whether its traditional swap or
> zswap or zram.
>
> Large folios have obvious advantages that have been discussed before
> like fewer page faults, batched PTE and rmap manipulation, reduced
> lru list, TLB coalescing (for arm64 and amd).
> However, swapping in large folios has its own drawbacks like higher
> swap thrashing.
> I had initially sent a RFC of zswapin of large folios in [1]
> but it causes a regression due to swap thrashing in kernel
> build time, which I am confident is happening with zram large
> folio swapin as well (which is merged in kernel).
>
> Some of the points we could discuss in the session:
>
> - What is the right (preferably open source) benchmark to test for
> swapin of large folios? kernel build time in limited
> memory cgroup shows a regression, microbenchmarks show a massive
> improvement, maybe there are benchmarks where TLB misses is
> a big factor and show an improvement.

My understanding is that it largely depends on the workload. In interactive
scenarios, such as on a phone, swap thrashing is not an issue because
there is minimal to no thrashing for the app occupying the screen
(foreground). In such cases, swap bandwidth becomes the most critical facto=
r
in improving app switching speed, especially when multiple applications
are switching between background and foreground states.

>
> - We could have something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
> to enable/disable swapin but its going to be difficult to tune, might
> have different optimum values based on workloads and are likely to be
> left at their default values. Is there some dynamic way to decide when
> to swapin large folios and when to fallback to smaller folios?
> swapin_readahead swapcache path which only supports 4K folios atm has a
> read ahead window based on hits, however readahead is a folio flag and
> not a page flag, so this method can't be used as once a large folio
> is swapped in, we won't get a fault and subsequent hits on other
> pages of the large folio won't be recorded.
>
> - For zswap and zram, it might be that doing larger block compression/
> decompression might offset the regression from swap thrashing, but it
> brings about its own issues. For e.g. once a large folio is swapped
> out, it could fail to swapin as a large folio and fallback
> to 4K, resulting in redundant decompressions.

That's correct. My current workaround involves swapping four small folios,
and zsmalloc will compress and decompress in chunks of four pages,
regardless of the actual size of the mTHP - The improvement in compression
ratio and speed becomes less significant after exceeding four pages, even
though there is still some increase.

Our recent experiments on phone also show that enabling direct reclamation
for do_swap_page() to allocate 2-order mTHP results in a 0% allocation
failure rate -  this probably removes the need for fallbacking to 4 small
folios. (Note that our experiments include Yu's TAO=E2=80=94Android GKI has
already merged it. However, since 2 is less than
PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even
without Yu's TAO, although I have not confirmed this.)

> This will also mean swapin of large folios from traditional swap
> isn't something we should proceed with?
>
> - Should we even support large folio swapin? You often have high swap
> activity when the system/cgroup is close to running out of memory, at thi=
s
> point, maybe the best way forward is to just swapin 4K pages and let
> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
> as well.

This approach might be suitable for non-interactive scenarios, such as buil=
ding
a kernel within a memory control group (memcg) or running other server
applications. However, performing collapse in interactive and power-sensiti=
ve
scenarios would be unnecessary and could lead to wasted power due to
memory migration and unmap/map operations.

However, it is quite challenging to automatically determine the type
of workloads
the system is running. I feel we still need a global control to decide whet=
her
to enable mTHP swap-in=E2=80=94not necessarily per size, but at least at a =
global level.
That said, there is evident resistance to introducing additional
controls to enable
or disable mTHP features.

By the way, Usama, have you ever tried switching between mglru and the
traditional
active/inactive LRU? My experience shows a significant difference in
swap thrashing
=E2=80=94active/inactive LRU exhibits much less swap thrashing in my local =
kernel build
tests.

the latest mm-unstable

*********** default mglru:   ***********

root@barry-desktop:/home/barry/develop/linux# ./build.sh
*** Executing round 1 ***
real 6m44.561s
user 46m53.274s
sys 3m48.585s
pswpin: 1286081
pswpout: 3147936
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 714580
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 286881
pgpgin: 17199072
pgpgout: 21493892
swpout_zero: 229163
swpin_zero: 84353

******** disable mglru ********

root@barry-desktop:/home/barry/develop/linux# echo 0 >
/sys/kernel/mm/lru_gen/enabled

root@barry-desktop:/home/barry/develop/linux# ./build.sh
*** Executing round 1 ***
real 6m27.944s
user 46m41.832s
sys 3m30.635s
pswpin: 474036
pswpout: 1434853
64kB-swpout: 0
32kB-swpout: 0
16kB-swpout: 331755
64kB-swpin: 0
32kB-swpin: 0
16kB-swpin: 106333
pgpgin: 11763720
pgpgout: 14551524
swpout_zero: 145050
swpin_zero: 87981

my build script:

#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled

vmstat_path=3D"/proc/vmstat"
thp_base_path=3D"/sys/kernel/mm/transparent_hugepage"

read_values() {
    pswpin=3D$(grep "pswpin" $vmstat_path | awk '{print $2}')
    pswpout=3D$(grep "pswpout" $vmstat_path | awk '{print $2}')
    pgpgin=3D$(grep "pgpgin" $vmstat_path | awk '{print $2}')
    pgpgout=3D$(grep "pgpgout" $vmstat_path | awk '{print $2}')
    swpout_zero=3D$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
    swpin_zero=3D$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
    swpout_64k=3D$(cat $thp_base_path/hugepages-64kB/stats/swpout
2>/dev/null || echo 0)
    swpout_32k=3D$(cat $thp_base_path/hugepages-32kB/stats/swpout
2>/dev/null || echo 0)
    swpout_16k=3D$(cat $thp_base_path/hugepages-16kB/stats/swpout
2>/dev/null || echo 0)
    swpin_64k=3D$(cat $thp_base_path/hugepages-64kB/stats/swpin
2>/dev/null || echo 0)
    swpin_32k=3D$(cat $thp_base_path/hugepages-32kB/stats/swpin
2>/dev/null || echo 0)
    swpin_16k=3D$(cat $thp_base_path/hugepages-16kB/stats/swpin
2>/dev/null || echo 0)
    echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k
$swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero
$swpin_zero"
}

for ((i=3D1; i<=3D1; i++))
do
  echo
  echo "*** Executing round $i ***"
  make ARCH=3Darm64 CROSS_COMPILE=3Daarch64-linux-gnu- clean 1>/dev/null 2>=
/dev/null
  echo 3 > /proc/sys/vm/drop_caches

  #kernel build
  initial_values=3D($(read_values))
  time systemd-run --scope -p MemoryMax=3D1G make ARCH=3Darm64 \
        CROSS_COMPILE=3Daarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/=
null
  final_values=3D($(read_values))

  echo "pswpin: $((final_values[0] - initial_values[0]))"
  echo "pswpout: $((final_values[1] - initial_values[1]))"
  echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
  echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
  echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
  echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
  echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
  echo "16kB-swpin: $((final_values[7] - initial_values[7]))"
  echo "pgpgin: $((final_values[8] - initial_values[8]))"
  echo "pgpgout: $((final_values[9] - initial_values[9]))"
  echo "swpout_zero: $((final_values[10] - initial_values[10]))"
  echo "swpin_zero: $((final_values[11] - initial_values[11]))"
  sync
  sleep 10
done

>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gma=
il.com/
> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/
> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Thanks,
> Usama

Thanks
Barry