From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E8B00CD37AA
	for <linux-mm@archiver.kernel.org>; Tue,  3 Sep 2024 18:25:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 804BD8D01C7; Tue,  3 Sep 2024 14:25:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7B43C8D016E; Tue,  3 Sep 2024 14:25:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 654508D01C7; Tue,  3 Sep 2024 14:25:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 4244F8D016E
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 14:25:02 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id F35FBC0708
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 18:25:01 +0000 (UTC)
X-FDA: 82524253602.07.304DD96
Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173])
	by imf12.hostedemail.com (Postfix) with ESMTP id E95F540012
	for <linux-mm@kvack.org>; Tue,  3 Sep 2024 18:24:58 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Vky85jf0;
	spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725387805;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ncA6gP+76zL35HCt4+Ds/Kp0TruK3dmKCnAyY9bgkS8=;
	b=Kuclv+QukHuGqRQ/6pTqTjtaOlR1aM1QYoCXGZc0NJaj6BNuFcLQ4YYl/zZ1u4fbnF/1Bo
	4i6BWipORMNQbzTHH9Ykap+LSOckJuQ1jra82DCQ7g3dPcwNtpcRJgGtoNQINGQyO8+YLU
	dzgjsQpKF2+a+IHRQyFvTRzT97H2Ei8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725387805; a=rsa-sha256;
	cv=none;
	b=FbGYlx857G6EESlVKYC0BzH+EmpzpioARyjprC+o/lmiLctNdv/ifd9piaXaHYlrpPH2vG
	OQUH57cr88tiOX8/fND9XNrZReoRDtJ4HJQyR6hmgkXAGHyfVL1rlF7dsNCRgTSVc+t5lW
	B2FJJx7YwIX+se5s8ODCGrq6FZUzKC4=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Vky85jf0;
	spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-2f64cc05564so9230041fa.2
        for <linux-mm@kvack.org>; Tue, 03 Sep 2024 11:24:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725387897; x=1725992697; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ncA6gP+76zL35HCt4+Ds/Kp0TruK3dmKCnAyY9bgkS8=;
        b=Vky85jf0nIXIMjqjf5eD7p6jKO3IbcFVU36lTM9ZSP55qiir5itH3+0nqMrlVx0N4U
         P1b4NbN6QEGZ42VjP73F5RxGIwdxtnHrx0Zrq/5KZGxfxd1mYRneYQYB/dmKa472ZBqK
         uXjMWH3YcM68QtS8fJSKXhvaeThPh2cro1VmQ5oOkfiXdP2Oov4ZSOEEgNFhrHEgREty
         CdYa0LSKlGmCth1pl6Vt8etU82Yw9Ig4Kc49+3JudW77bXn57l+Bg+uErjENlLaewR16
         qFZRSubIJx3afqvKZOw+aXTuJ0RQ6Rmd0t7XqrlQpuGlWrDlZ/lBVLbi+TWIh27vMirs
         uSDQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725387897; x=1725992697;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ncA6gP+76zL35HCt4+Ds/Kp0TruK3dmKCnAyY9bgkS8=;
        b=PHqOQBKN5hoCAIxM9DggVxndG04Jl2qR89+RHXQg/rL5DLeGfzxWJ6E5TZVaoHK+YY
         klrCfAet/9e1w0sEfo6LE62Pje8SSYadqLKVV1PM6nbuikefiOA8EmYM40U48bc3gx8m
         oKMCru+4jRrRHX63CuMrftsUS6BQdzerDYVdrcJBTXzyQA03YDBTDyajfoW6QaTZCExM
         iQW8GaMnD1PVCXLRurYegIxj9kO4d342WYgs9PxWYS7hqCdL7uMX8Y7i//5waAKHRdGx
         5fic80Jjm5FmDIL51AXTFSvLCQB8GnYoAjq9ESL9qz5pJxyGRVZxOsSdMh3bmW/R7XNS
         p99A==
X-Forwarded-Encrypted: i=1; AJvYcCVo5wsgmXMZo7QAr+f6Je1sy805ejMxViaSfjd8BcWU2yKAh2OV6XJNnz7e6RUWy0la3wEGca+NNw==@kvack.org
X-Gm-Message-State: AOJu0YwsEPiGoqOOmiOzWV2f/KTpQ/clDpuQMxvZlJmg0yvx9qHwIkaS
	LIUjjq7O/EBg5iOOwHOFYIpu8hIpya3BTANYnBc7N+pYg3AXLK/NRSgqbxjTH+ZlNFvX9RzdxXU
	AeRhvva0YFcUGXkA9JT9PK8LDvcg=
X-Google-Smtp-Source: AGHT+IGBcaiRetcGA1tcldlqoEIwfej527nUHbpm6xJtMeEQ+MNIoinZ/nNv7KjHy8Z0coQMoAxOWdOhvdjFeuhhrFE=
X-Received: by 2002:a2e:a586:0:b0:2f3:f4e2:869c with SMTP id
 38308e7fff4ca-2f64d570e98mr13690921fa.44.1725387896474; Tue, 03 Sep 2024
 11:24:56 -0700 (PDT)
MIME-Version: 1.0
References: <20240821074541.516249-1-hanchuanhua@oppo.com> <20240821074541.516249-3-hanchuanhua@oppo.com>
In-Reply-To: <20240821074541.516249-3-hanchuanhua@oppo.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Wed, 4 Sep 2024 02:24:39 +0800
Message-ID: <CAMgjq7BpOqgKoeQEPCL9ai3dvVPv7wJe3k_g1hDjAVeCLpZ=7w@mail.gmail.com>
Subject: Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io devices
To: hanchuanhua@oppo.com, Usama Arif <usamaarif642@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, 
	baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, 
	hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, 
	linux-kernel@vger.kernel.org, mhocko@suse.com, minchan@kernel.org, 
	nphamcs@gmail.com, ryan.roberts@arm.com, senozhatsky@chromium.org, 
	shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com, 
	v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, 
	ying.huang@intel.com, yosryahmed@google.com, hch@infradead.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: E95F540012
X-Stat-Signature: rhkgqrcq6q1wyu8b4pxowx3c53phke6q
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1725387898-973014
X-HE-Meta: U2FsdGVkX1/2kgPC44mXiyPFB+74YMGNezw9siiRfoWRXhmhgxwQSALnfaduPGp4n7cSbaaXRDVqQySju2vBZa0Ni5N6Knu098H8Ve9i5qJSpLknWNuSI/kJDrTUJsWX5v6UMM7hUgpGHpbc4QpHTr+SCzFgJ8UJ2Tzr8ylP2taekxKGl3zbwuctCWoHPnH5Nz3rL+0MlteO7KclIXWj+1L5P0DB3DXRjvjYLZ48jyKX6vttmoPPk3It7QqVL/TAiuwXZZ+94OR7uh16AMpRscOZ31jihf0x90cb6P7ofuOs1tqWfecfTjchdT0uEk9n569lMNJOemsrwGzCqWDCF/JdK9Lwzy7R9tv+2iJsxkpPIYy/dsvDud1PpFU4KZOLIX5EM/9YG/5dvvM+R0VPGwrYFHnKKEFAup6qbn0LnwOpUTsX+WRaOX6JCyvjyLHFOtm+cjEQrSl5aomyVgJSD5pcr354BEACPqKIxElznpB9ZHm4HtoOUgvqwEErI99MpCzJmKKusBje9v6ur5jH7LtrfvVCL/k+ad/xLY0Iffd5lef4R7+16FhRz8VVfjjPVAYZeIY8776kofv2Hx0XxcRHhAfU00XV0pfF5WTV1pUw1HCJ2Ulpj3AuuAnLjz+kGIISbkDQLIQgMIjgfOC0BSsLdXSoWQMHY5uvurmDy5kLE04zv8hx8Ts/Ein7+8X5Dllr+0cBuSbf9c1YTr8wq2dgslfWOOlt4WLVzjTM8fXqdF3NgGBapI+C66Lk5RsgadD+ZlvC9jpWDs7PiCUn/bi8CBKEK03GXcs2Dk9FfWXaoLRi2lb5wBk8QVX9f+YwVNcr8PZYrUsK7ToVk3SpQ3fEORAo3FU0cmbB4AKPjEe6xrUk6uJCjLnXjcFpgNmfrO83scVY1oeRcoFopD5AA3ki8NoUeiDkX+obHB+agooeC5DtqMUYNCBOYVEoty9H5301GI3dbApfmz+ZEwO
 blD/ypUo
 eQeXsUEqaRGSy/nHBf/vhps/evUD55LIw72Y2R/KRfIeygH5oTEypOmPaAOaxt6NggYjrQkNVuqKe/mLkposq7cg9weXyfqtwxCdz2pywI269pK51DdxDkTDun1ecy2XFqDGG3dQx/g1kK+UbukDBNLw9aZEre+pf5U9DzAb2qB7uRamlljYbLFAckBjWxZ6igWgQ3/xlgF8Scw+kpiXtNBNRZ2uynvFEaqQcVzOWAUAPHag0noZ2O57Ea3lOk5gTNM8pCT2aa0/U17wpOPz5Ygn/rzYxzGidWUD43VCOG6XPAGyaCcH9aa68A5L6Evw4WeQJxriz5O5r2C3g9o5H0KsGVl/2lQ3RxEH7nVsiY1URS7GpOlT0sObAHyPpeNimxwVJBUnk8dD7yMVCWGv/xMzuNVf5L20gnndFPSLDK8Q5SCHdru4E0gsBZ7vfEu30Etqq
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi All,


On Wed, Aug 21, 2024 at 3:46=E2=80=AFPM <hanchuanhua@oppo.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> Currently, we have mTHP features, but unfortunately, without support for
> large folio swap-ins, once these large folios are swapped out, they are
> lost because mTHP swap is a one-way process. The lack of mTHP swap-in
> functionality prevents mTHP from being used on devices like Android that
> heavily rely on swap.
>
> This patch introduces mTHP swap-in support. It starts from sync devices
> such as zRAM. This is probably the simplest and most common use case,
> benefiting billions of Android phones and similar devices with minimal
> implementation cost. In this straightforward scenario, large folios are
> always exclusive, eliminating the need to handle complex rmap and
> swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in. Large folios in the buddy system are also
>    preserved as much as possible, rather than being fragmented due
>    to swap-in.
>
> 2. Eliminates fragmentation in swap slots and supports successful
>    THP_SWPOUT.
>
>    w/o this patch (Refer to the data from Chris's and Kairui's latest
>    swap allocator optimization while running ./thp_swap_allocator_test
>    w/o "-a" option [1]):
>
>    ./thp_swap_allocator_test
>    Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback perce=
ntage: 43.53%
>    Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percen=
tage: 68.58%
>    Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percen=
tage: 75.34%
>    Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percen=
tage: 84.51%
>    Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percen=
tage: 88.84%
>    Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percen=
tage: 89.91%
>    Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percent=
age: 96.05%
>    Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percen=
tage: 94.25%
>    Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback perce=
ntage: 94.74%
>    Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback perce=
ntage: 93.01%
>    Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback perce=
ntage: 95.45%
>    Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback perce=
ntage: 92.98%
>    Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback perce=
ntage: 94.64%
>    Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback perce=
ntage: 93.36%
>    Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback perce=
ntage: 93.02%
>    Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percen=
tage: 96.07%
>
>    w/ this patch (always 0%):
>    Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percent=
age: 0.00%
>    Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>    Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percen=
tage: 0.00%
>    ...
>
> 3. With both mTHP swap-out and swap-in supported, we offer the option to =
enable
>    zsmalloc compression/decompression with larger granularity[2]. The upc=
oming
>    optimization in zsmalloc will significantly increase swap speed and im=
prove
>    compression efficiency. Tested by running 100 iterations of swapping 1=
00MiB
>    of anon memory, the swap speed improved dramatically:
>                 time consumption of swapin(ms)   time consumption of swap=
out(ms)
>      lz4 4k                  45274                    90540
>      lz4 64k                 22942                    55667
>      zstdn 4k                85035                    186585
>      zstdn 64k               46558                    118533
>
>     The compression ratio also improved, as evaluated with 1 GiB of data:
>      granularity   orig_data_size   compr_data_size
>      4KiB-zstd      1048576000       246876055
>      64KiB-zstd     1048576000       199763892
>
>    Without mTHP swap-in, the potential optimizations in zsmalloc cannot b=
e
>    realized.
>
> 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
>    of nr_pages. Swapping in content filled with the same data 0x11, w/o
>    and w/ the patch for five rounds (Since the content is the same,
>    decompression will be very fast. This primarily assesses the impact of
>    reduced page faults):
>
>   swp in bandwidth(bytes/ms)    w/o              w/
>    round1                     624152          1127501
>    round2                     631672          1127501
>    round3                     620459          1139756
>    round4                     606113          1139756
>    round5                     624152          1152281
>    avg                        621310          1137359      +83%
>
> [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297=
@kernel.org/
> [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 223 insertions(+), 27 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b9fe2f354878..7aa0358a4160 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3986,6 +3986,184 @@ static vm_fault_t handle_pte_marker(struct vm_fau=
lt *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma =3D vmf->vma;
> +       struct folio *folio;
> +       swp_entry_t entry;
> +
> +       folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
> +                               vmf->address, false);
> +       if (!folio)
> +               return NULL;
> +
> +       entry =3D pte_to_swp_entry(vmf->orig_pte);
> +       if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +                                          GFP_KERNEL, entry)) {
> +               folio_put(folio);
> +               return NULL;
> +       }
> +
> +       return folio;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Check if the PTEs within a range are contiguous swap entries
> + * and have no cache when check_no_cache is true.
> + */
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                          int nr_pages, bool check_no_cache)
> +{
> +       struct swap_info_struct *si;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       pgoff_t offset;
> +       int idx, i;
> +       pte_t pte;
> +
> +       addr =3D ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +       idx =3D (vmf->address - addr) / PAGE_SIZE;
> +       pte =3D ptep_get(ptep);
> +
> +       if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +               return false;
> +       entry =3D pte_to_swp_entry(pte);
> +       offset =3D swp_offset(entry);
> +       if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages)
> +               return false;
> +
> +       if (!check_no_cache)
> +               return true;
> +
> +       si =3D swp_swap_info(entry);
> +       /*
> +        * While allocating a large folio and doing swap_read_folio, whic=
h is
> +        * the case the being faulted pte doesn't have swapcache. We need=
 to
> +        * ensure all PTEs have no cache as well, otherwise, we might go =
to
> +        * swap devices while the content is in swapcache.
> +        */
> +       for (i =3D 0; i < nr_pages; i++) {
> +               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> +                                                    unsigned long addr,
> +                                                    unsigned long orders=
)
> +{
> +       int order, nr;
> +
> +       order =3D highest_order(orders);
> +
> +       /*
> +        * To swap in a THP with nr pages, we require that its first swap=
_offset
> +        * is aligned with that number, as it was when the THP was swappe=
d out.
> +        * This helps filter out most invalid entries.
> +        */
> +       while (orders) {
> +               nr =3D 1 << order;
> +               if ((addr >> PAGE_SHIFT) % nr =3D=3D swp_offset % nr)
> +                       break;
> +               order =3D next_order(&orders, order);
> +       }
> +
> +       return orders;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma =3D vmf->vma;
> +       unsigned long orders;
> +       struct folio *folio;
> +       unsigned long addr;
> +       swp_entry_t entry;
> +       spinlock_t *ptl;
> +       pte_t *pte;
> +       gfp_t gfp;
> +       int order;
> +
> +       /*
> +        * If uffd is active for the vma we need per-page fault fidelity =
to
> +        * maintain the uffd semantics.
> +        */
> +       if (unlikely(userfaultfd_armed(vma)))
> +               goto fallback;
> +
> +       /*
> +        * A large swapped out folio could be partially or fully in zswap=
. We
> +        * lack handling for such cases, so fallback to swapping in order=
-0
> +        * folio.
> +        */
> +       if (!zswap_never_enabled())
> +               goto fallback;
> +
> +       entry =3D pte_to_swp_entry(vmf->orig_pte);
> +       /*
> +        * Get a list of all the (large) orders below PMD_ORDER that are =
enabled
> +        * and suitable for swapping THP.
> +        */
> +       orders =3D thp_vma_allowable_orders(vma, vma->vm_flags,
> +                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1=
);
> +       orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);
> +       orders =3D thp_swap_suitable_orders(swp_offset(entry),
> +                                         vmf->address, orders);
> +
> +       if (!orders)
> +               goto fallback;
> +
> +       pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +                                 vmf->address & PMD_MASK, &ptl);
> +       if (unlikely(!pte))
> +               goto fallback;
> +
> +       /*
> +        * For do_swap_page, find the highest order where the aligned ran=
ge is
> +        * completely swap entries with contiguous swap offsets.
> +        */
> +       order =3D highest_order(orders);
> +       while (orders) {
> +               addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order=
, true))
> +                       break;
> +               order =3D next_order(&orders, order);
> +       }
> +
> +       pte_unmap_unlock(pte, ptl);
> +
> +       /* Try allocating the highest of the remaining orders. */
> +       gfp =3D vma_thp_gfp_mask(vma);
> +       while (orders) {
> +               addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +               folio =3D vma_alloc_folio(gfp, order, vma, addr, true);
> +               if (folio) {
> +                       if (!mem_cgroup_swapin_charge_folio(folio, vma->v=
m_mm,
> +                                                           gfp, entry))
> +                               return folio;
> +                       folio_put(folio);
> +               }
> +               order =3D next_order(&orders, order);
> +       }
> +
> +fallback:
> +       return __alloc_swap_folio(vmf);
> +}
> +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +                                 int nr_pages, bool check_no_cache)
> +{
> +       return false;
> +}
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +       return __alloc_swap_folio(vmf);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4074,34 +4252,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) =3D=3D 1) {
> -                       /*
> -                        * Prevent parallel swapin from proceeding with
> -                        * the cache flag. Otherwise, another thread may
> -                        * finish swapin first, free the entry, and swapo=
ut
> -                        * reusing the same entry. It's undetectable as
> -                        * pte_same() returns true due to entry reuse.
> -                        */
> -                       if (swapcache_prepare(entry, 1)) {
> -                               /* Relax a bit to prevent rapid repeated =
page faults */
> -                               schedule_timeout_uninterruptible(1);
> -                               goto out;
> -                       }
> -                       need_clear_cache =3D true;
> -
>                         /* skip swapcache */
> -                       folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0=
,
> -                                               vma, vmf->address, false)=
;
> +                       folio =3D alloc_swap_folio(vmf);
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> -                                                       vma->vm_mm, GFP_K=
ERNEL,
> -                                                       entry)) {
> -                                       ret =3D VM_FAULT_OOM;
> +                               nr_pages =3D folio_nr_pages(folio);
> +                               if (folio_test_large(folio))
> +                                       entry.val =3D ALIGN_DOWN(entry.va=
l, nr_pages);
> +                               /*
> +                                * Prevent parallel swapin from proceedin=
g with
> +                                * the cache flag. Otherwise, another thr=
ead
> +                                * may finish swapin first, free the entr=
y, and
> +                                * swapout reusing the same entry. It's
> +                                * undetectable as pte_same() returns tru=
e due
> +                                * to entry reuse.
> +                                */
> +                               if (swapcache_prepare(entry, nr_pages)) {
> +                                       /*
> +                                        * Relax a bit to prevent rapid
> +                                        * repeated page faults.
> +                                        */
> +                                       schedule_timeout_uninterruptible(=
1);
>                                         goto out_page;
>                                 }
> -                               mem_cgroup_swapin_uncharge_swap(entry, 1)=
;
> +                               need_clear_cache =3D true;
> +
> +                               mem_cgroup_swapin_uncharge_swap(entry, nr=
_pages);
>
>                                 shadow =3D get_shadow_from_swap_cache(ent=
ry);
>                                 if (shadow)
> @@ -4207,6 +4385,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out_nomap;
>         }
>
> +       /* allocated large folios for SWP_SYNCHRONOUS_IO */
> +       if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
> +               unsigned long nr =3D folio_nr_pages(folio);
> +               unsigned long folio_start =3D ALIGN_DOWN(vmf->address,
> +                                                      nr * PAGE_SIZE);
> +               unsigned long idx =3D (vmf->address - folio_start) / PAGE=
_SIZE;
> +               pte_t *folio_ptep =3D vmf->pte - idx;
> +
> +               if (!can_swapin_thp(vmf, folio_ptep, nr, false))
> +                       goto out_nomap;
> +
> +               page_idx =3D idx;
> +               address =3D folio_start;
> +               ptep =3D folio_ptep;
> +               goto check_folio;
> +       }
> +
>         nr_pages =3D 1;
>         page_idx =3D 0;
>         address =3D vmf->address;
> @@ -4338,11 +4533,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_lru_vma(folio, vma);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> -                * We currently only expect small !anon folios, which are=
 either
> -                * fully exclusive or fully shared. If we ever get large =
folios
> -                * here, we have to be careful.
> +                * We currently only expect small !anon folios which are =
either
> +                * fully exclusive or fully shared, or new allocated larg=
e
> +                * folios which are fully exclusive. If we ever get large
> +                * folios within swapcache here, we have to be careful.
>                  */
> -               VM_WARN_ON_ONCE(folio_test_large(folio));
> +               VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swa=
pcache(folio));
>                 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>         } else {
> @@ -4385,7 +4581,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out:
>         /* Clear the swap cache pin for direct swapin after PTL unlock */
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> @@ -4401,7 +4597,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_put(swapcache);
>         }
>         if (need_clear_cache)
> -               swapcache_clear(si, entry, 1);
> +               swapcache_clear(si, entry, nr_pages);
>         if (si)
>                 put_swap_device(si);
>         return ret;
> --
> 2.43.0
>

With latest mm-unstable, I'm seeing following WARN followed by user
space segfaults (multiple mTHP enabled):

[   39.145686] ------------[ cut here ]------------
[   39.145969] WARNING: CPU: 24 PID: 11159 at mm/page_io.c:535
swap_read_folio+0x4db/0x520
[   39.146307] Modules linked in:
[   39.146507] CPU: 24 UID: 1000 PID: 11159 Comm: sh Kdump: loaded Not
tainted 6.11.0-rc6.orig+ #131
[   39.146887] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[   39.147206] RIP: 0010:swap_read_folio+0x4db/0x520
[   39.147430] Code: 00 e0 ff ff 09 c1 83 f8 08 0f 42 d1 e9 c4 fe ff
ff 48 63 85 34 02 00 00 48 03 45 08 49 39 c4 0f 85 63 fe ff ff e9 db
fe ff ff <0f> 0b e9 91 fd ff ff 31 d2 e9 9d fe ff ff 48 c7 c6 38 b6 4e
82 48
[   39.148079] RSP: 0000:ffffc900045c3ce0 EFLAGS: 00010202
[   39.148390] RAX: 0017ffffd0020061 RBX: ffffea00064d4c00 RCX: 03fffffffff=
fffff
[   39.148737] RDX: ffffea00064d4c00 RSI: 0000000000000000 RDI: ffffea00064=
d4c00
[   39.149102] RBP: 0000000000000001 R08: ffffea00064d4c00 R09: 00000000000=
00078
[   39.149482] R10: 00000000000000f0 R11: 0000000000000004 R12: 00000000000=
01000
[   39.149832] R13: ffff888102df5c00 R14: ffff888102df5c00 R15: 00000000000=
00003
[   39.150177] FS:  00007f51a56c9540(0000) GS:ffff888fffc00000(0000)
knlGS:0000000000000000
[   39.150623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.150950] CR2: 000055627b13fda0 CR3: 00000001083e2000 CR4: 00000000003=
506b0
[   39.151317] Call Trace:
[   39.151565]  <TASK>
[   39.151778]  ? __warn+0x84/0x130
[   39.152044]  ? swap_read_folio+0x4db/0x520
[   39.152345]  ? report_bug+0xfc/0x1e0
[   39.152614]  ? handle_bug+0x3f/0x70
[   39.152891]  ? exc_invalid_op+0x17/0x70
[   39.153178]  ? asm_exc_invalid_op+0x1a/0x20
[   39.153467]  ? swap_read_folio+0x4db/0x520
[   39.153753]  do_swap_page+0xc6d/0x14f0
[   39.154054]  ? srso_return_thunk+0x5/0x5f
[   39.154361]  __handle_mm_fault+0x758/0x850
[   39.154645]  handle_mm_fault+0x134/0x340
[   39.154945]  do_user_addr_fault+0x2e5/0x760
[   39.155245]  exc_page_fault+0x6a/0x140
[   39.155546]  asm_exc_page_fault+0x26/0x30
[   39.155847] RIP: 0033:0x55627b071446
[   39.156124] Code: f6 7e 19 83 e3 01 74 14 41 83 ee 01 44 89 35 25
72 0c 00 45 85 ed 0f 88 73 02 00 00 8b 05 ea 74 0c 00 85 c0 0f 85 da
03 00 00 <44> 8b 15 53 e9 0c 00 45 85 d2 74 2e 44 8b 0d 37 e3 0c 00 45
85 c9
[   39.156944] RSP: 002b:00007ffd619d54f0 EFLAGS: 00010246
[   39.157237] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f51a44=
f968b
[   39.157594] RDX: 0000000000000000 RSI: 00007ffd619d5518 RDI: 00000000fff=
fffff
[   39.157954] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000=
00007
[   39.158288] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000=
00001
[   39.158634] R13: 0000000000002b9a R14: 0000000000000000 R15: 00007ffd619=
d5518
[   39.158998]  </TASK>
[   39.159226] ---[ end trace 0000000000000000 ]---

After reverting this or Usama's "mm: store zero pages to be swapped
out in a bitmap", the problem is gone. I think these two patches may
have some conflict that needs to be resolved.