From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75C01C48BF6
	for <linux-mm@archiver.kernel.org>; Thu, 29 Feb 2024 23:31:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 079FC94000B; Thu, 29 Feb 2024 18:31:53 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 02A5D940007; Thu, 29 Feb 2024 18:31:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D52B094000B; Thu, 29 Feb 2024 18:31:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B9F08940007
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 18:31:52 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 80DC7803D6
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 23:31:52 +0000 (UTC)
X-FDA: 81846441264.08.189CAC8
Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180])
	by imf19.hostedemail.com (Postfix) with ESMTP id 7B1A41A0020
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 23:31:50 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YOYI6tOj;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709249510;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=;
	b=uo5nIgGyEMabCdyX72BZxyBSEqF3u4e7bQ958t2p+0czEjvi5E4/KErJHtSMd3l6qBAVZB
	ejt7wvdxGZm0H42iDOvVsBlig0CYwQkVk/mPpWxRC/Hwj4SQr+zpsIX1vjcRFPuk7GgK7m
	Udu+7NwWGDKkx/sdbIx+DgVZvmyz6R8=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YOYI6tOj;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709249510; a=rsa-sha256;
	cv=none;
	b=PX3RUKsL/hvrdwuql5rbme4QidoKJAhI0YeR86UNUN4KVlMicUeqc7navkX/gjGt/LcQFf
	DP9Qm9UkfXd7x8X1Rn3StnMnNNBSBI+Ciz4aCrLVeXyaMqxGnuZ1QxswjM5G+YSdPwFcjt
	UcFuLDa9vJFn4nQYZc88IYaVhdrWL/s=
Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-6e58d259601so915506b3a.3
        for <linux-mm@kvack.org>; Thu, 29 Feb 2024 15:31:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709249509; x=1709854309; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=;
        b=YOYI6tOjMzzDCOwLQeeaH16M6zyYYTbmE9sjLVF4/xsH+JJX9hCLm9D7P7sWw7uGA0
         vQCYvSD2Fh1lFgzwVoYDzGr1OE2ZiJ+NIRm0mUvyxL02EBVfkpY9hQ6TIHp+3zdrvhJJ
         nYXvazlOh7kn+o8+QliMNSi9LyI1ilq2xyzmU3tkuyRKHNXDQCFYfqP1Ojx9BUVQjipf
         x1Txde9eKKSM0aiuWuCnsLzTBtVn7C36IXjJiLrT63MEjPFnkqOYGNfKWW6e7s77kmqz
         ndtDHWS7/kQqVj0xmXurzOPMpX8hGEic8uUEj6pFpRDnmU57mYrRuwqNKnyspZjBypac
         Rxig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709249509; x=1709854309;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=;
        b=YiEIwU+3eRshKKSnPHKVMv+8IRteGxNXckhk1odgz6FZQwRCQjTlkNZgsKSaNYLaVt
         nXQH7FKBDLFGx8nzenT/MzQmyf7aSTAACK3RqXwMd8rz4vNbkiDELCk2h0Wf+MFSUWko
         hGxWy6MF+s/paKoVebO4/spRNgEP5+rILhKSdmRqASFYQtrLNxdM0ZxJ0w40JYaNLmSy
         rNBI+tDWwEvlDdbLRERXjnhFoznO9Sdjd1FxN+rafZeuWwEVHKGkK2BsLIh1vASur4iM
         GsyGrlImpTpwLMiEUR2fWBFGokSyNm5GWvCCV11iduC9FQOI6oV73rV1+wxO8FL/p8Nl
         YdKw==
X-Forwarded-Encrypted: i=1; AJvYcCXEk/1AWIxByVD6PVjssp19kpQG+w2m02yc86rpWDkxsDyEpLfKIBwFi3ZDEZtNbn7Olsgz5oYwdpaEMrEr2Jm9nuc=
X-Gm-Message-State: AOJu0YzPkDml2CB3xp4cfFnQJqTzGHk5QiMasdpZ4vGqMe7D8HTlsTle
	392b8/NWUGengItqr9wUg57Abht0bhh1/VoF09LK0bAYMPM60odE2ztfZGhpeyvm5EpInrKqS23
	I2BHNLaZC5sy2MoKAIxa2RTLng5t0dbEx
X-Google-Smtp-Source: AGHT+IGLmB4TyowcnUsCaUJOnj0aoE17svyP0oLIZQy8qBDH/ITzJGAfdB7TYZ4R3WXgfQ3q07f1DhmtxLHY0UY0/os=
X-Received: by 2002:a05:6a21:2d84:b0:1a0:df64:26c with SMTP id
 ty4-20020a056a212d8400b001a0df64026cmr4288392pzb.16.1709249508807; Thu, 29
 Feb 2024 15:31:48 -0800 (PST)
MIME-Version: 1.0
References: <20240229183436.4110845-1-yuzhao@google.com> <20240229183436.4110845-2-yuzhao@google.com>
In-Reply-To: <20240229183436.4110845-2-yuzhao@google.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 29 Feb 2024 15:31:36 -0800
Message-ID: <CAHbLzkrpA1TfyLsOcqZ01KdR4-SjXpGrTOeJ+UjzeR_-2Feagw@mail.gmail.com>
Subject: Re: [Chapter One] THP zones: the use cases of policy zones
To: Yu Zhao <yuzhao@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, 
	Jonathan Corbet <corbet@lwn.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 7B1A41A0020
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: y8bg7xdbw4djatj5xbqqcpd35bu6j6bk
X-HE-Tag: 1709249510-586646
X-HE-Meta: U2FsdGVkX18ihU/Q5KbYMK8AUzPbSb72x0ROSSDlrThrWpfahOcD0Es8a70FIjuYqPykYferieWgYOuhqpBbXSg28Zsm2aNyd8DpRP8NzmOtMYVEyomE7YajOjgtFfy8oIriQZolX8Kr64r22zvPulhQLXM9h94gMEPb9RnbL2raiyjp3BZb7AY9VdXAlRVbpYbUVgLVA4FuaWpoH9gaCILJ3e3aqikotZi9VGfln3GcfTrwMViCHqZeaThJCQKOOKJpp2YOEwsczY42ZiRbnkmPGp5J7lJh5pG98kDZdiqlNLTo7mvWGhIaYCkN3IfccmcfQxqCGov4wVGKXNqWRisqjtnPfN9Eh8W9jq7JQY0bjo1yGBpWO3eoz9s1y4ZzgPBQZoDIyKu/iCAnTxvaW00xEW9xy0wOX6ifHMRMWJTG1f0xXjyOm9+GueCl2qBiOIvrk8QumcFN9qtjbVtW9LTH55D1FlkxRNbZrTy5JvUtGxkY6ZmhUwZ/ds5vDHyUXaZexqlTk1fsmtiXOGsGZa35zhVa3p9xCvcy7obO9HH3Eo4fFHDto7lBMH2aqv6oWtoaFyiGP5GnXVgZyp1w3hoaGrFiI1N6hVuGtnCkPSCRSCqFWLQkfNzTySJn+z7zR9eapTE57DbpW3QHjVXxhEbmLjXNQHuNAD+48NNfBMt7X/wDvZQxS9vaeNX9zVRW6ZXCBh3p0qs0Bt0K9JaXUAeSVXzlrHXNRMycz0OVDeprsFAnqCEAjCmQTqxxr8MZhM7EKfr+56CzcEWV1JLi+68ndYG9GUssUJBKjTPaz0T9VhWr/lyejsnD10n3yvRBTOYYKy7Ln3Z3moU6Q4nYMgPfGAie7rEuo/OUL44a9iAtRnfVv0T5WZoT8pJ7OFQYqaHKwQe3YsAOqDjmpaV4kICmoJus5KzzselcZjy5ZkfZo+GQvN6YwZava/R+HSJ4OiXz/2l7UNcrPC4aekv
 paMYUgBa
 Ma87/24TmBcz/K5a0gxdWadGC6A015y7E21wzHkUZsKh2i/ptPpGPgiKKGxtKVelL+CTuezsj99Upwd2f1enl5L9756gTB4tLjw6mRouteWzVaSoNDWqT0ZHoyoZgJmV1Whe20bTWuj6Evsx93gpn0Rp7vZNR6dU3GXmBG4GhnEOAnYYHzHVb0iiHQoMpXoluy+PmZnTCDXl+4OFpTE8Q2f94k/iiX+JsCmMDigIwCH8ZUjc/tNS0y3lI/ZzFIr5r+CLfL2ICYLJJ/8O8MUcN5Uap9798slJ0o0kHWkdtnN1if1QnG57vAtVdVi4S2zXBoLVomR08QObV+rqUXHAQfzy4HZWJwpVWGMyuWbq9F0s6d3XgKwDL1lsRf6DO99tUSiSkQCiP6XeVykvfz/zH9avdjnVs8PnY3aqPzPYqJhSpG+1SEEI91frkAOTi7fLx3HB29b1WZSkSW73XGtIX4nZqQQRUipT/uUZhTO0d2B7f8zfldGANdTBhlJs6dE0VqxfzoMHm1u+4zxMVxQLnJ70f659WK3hGnpx4P5zK/qXzkZM=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 29, 2024 at 10:34=E2=80=AFAM Yu Zhao <yuzhao@google.com> wrote:
>
> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
>    memory.
> 2. The device zone provides interoperability between CPU and device
>    memory.
> 3. The movable zone commonly represents a memory allocation policy.
>
> Though originally designed for memory hot removal, the movable zone is
> instead widely used for other purposes, e.g., CMA and kdump kernel, on
> platforms that do not support hot removal, e.g., Android and ChromeOS.
> Nowadays, it is legitimately a zone independent of any physical
> characteristics. In spite of being somewhat regarded as a hack,
> largely due to the lack of a generic design concept for its true major
> use cases (on billions of client devices), the movable zone naturally
> resembles a policy (virtual) zone overlayed on the first four
> (physical) zones.
>
> This proposal formally generalizes this concept as policy zones so
> that additional policies can be implemented and enforced by subsequent
> zones after the movable zone. An inherited requirement of policy zones
> (and the first four zones) is that subsequent zones must be able to
> fall back to previous zones and therefore must add new properties to
> the previous zones rather than remove existing ones from them. Also,
> all properties must be known at the allocation time, rather than the
> runtime, e.g., memory object size and mobility are valid properties
> but hotness and lifetime are not.
>
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
>    ZONE_MOVABLE) and restricted to a minimum order to be
>    anti-fragmentation. The latter means that they cannot be split down
>    below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
>    to an exact order. The latter means that not only is split
>    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
>    reason in Chapter Three), while they are free or in use.
>
> Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> compaction is not needed for these two zones.
>
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
>    higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
>    unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
>    contiguous PTEs on arm64 [1], which are more suitable for client
>    workloads.

I think the allocation fallback policy needs to be elaborated. IIUC,
when allocating large folios, if the order > min order of the policy
zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
ZONE_MOVABLE    -> ZONE_NORMAL, right?

If all other zones are depleted, the allocation, whose order is < the
min order, won't fallback to the policy zones and will fail, just like
the non-movable allocation can't fallback to ZONE_MOVABLE even though
there is enough memory for that zone, right?

>
> Policy zones can be dynamically resized by offlining pages in one of
> them and onlining those pages in another of them. Note that this is
> only done among policy zones, not between a policy zone and a physical
> zone, since resizing is a (software) policy, not a physical
> characteristic.
>
> Implementing the same idea in the pageblock granularity has also been
> explored but rejected at Google. Pageblocks have a finer granularity
> and therefore can be more flexible than zones. The tradeoff is that
> this alternative implementation was more complex and failed to bring a
> better ROI. However, the rejection was mainly due to its inability to
> be smoothly extended to 1GB THPs [2], which is a planned use case of
> TAO.
>
> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com=
/
> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  drivers/virtio/virtio_mem.c                   |   2 +-
>  include/linux/gfp.h                           |  24 +-
>  include/linux/huge_mm.h                       |   6 -
>  include/linux/mempolicy.h                     |   2 +-
>  include/linux/mmzone.h                        |  52 +-
>  include/linux/nodemask.h                      |   2 +-
>  include/linux/vm_event_item.h                 |   2 +-
>  include/trace/events/mmflags.h                |   4 +-
>  mm/compaction.c                               |  12 +
>  mm/huge_memory.c                              |   5 +-
>  mm/mempolicy.c                                |  14 +-
>  mm/migrate.c                                  |   7 +-
>  mm/mm_init.c                                  | 452 ++++++++++--------
>  mm/page_alloc.c                               |  44 +-
>  mm/page_isolation.c                           |   2 +-
>  mm/swap_slots.c                               |   3 +-
>  mm/vmscan.c                                   |  32 +-
>  mm/vmstat.c                                   |   7 +-
>  19 files changed, 431 insertions(+), 251 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentat=
ion/admin-guide/kernel-parameters.txt
> index 31b3a25680d0..a6c181f6efde 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3529,6 +3529,16 @@
>                         allocations which rules out almost all kernel
>                         allocations. Use with caution!
>
> +       nosplit=3DX,Y     [MM] Set the minimum order of the nosplit zone.=
 Pages in
> +                       this zone can't be split down below order Y, whil=
e free
> +                       or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] o=
r n%.
> +
> +       nomerge=3DX,Y     [MM] Set the exact orders of the nomerge zone. =
Pages in
> +                       this zone are always order Y, meaning they can't =
be
> +                       split or merged while free or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] o=
r n%.
> +
>         MTD_Partition=3D  [MTD]
>                         Format: <name>,<region-number>,<size>,<offset>
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 8e3223294442..37ecf5ee4afd 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct vir=
tio_mem *vm,
>                 page =3D pfn_to_online_page(pfn);
>                 if (!page)
>                         continue;
> -               if (page_zonenum(page) !=3D ZONE_MOVABLE)
> +               if (!is_zone_movable_page(page))
>                         return false;
>         }
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..c0f9d21b4d18 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t =
gfp_flags)
>   * GFP_ZONES_SHIFT must be <=3D 2 on 32 bit platforms.
>   */
>
> -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <=3D 4
> -/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <=3D 4
> +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */
>  #define GFP_ZONES_SHIFT 2
>  #else
>  #define GFP_ZONES_SHIFT ZONES_SHIFT
> @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags)
>         z =3D (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
>                                          ((1 << GFP_ZONES_SHIFT) - 1);
>         VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> +
> +       if ((flags & (__GFP_MOVABLE | __GFP_COMP)) =3D=3D (__GFP_MOVABLE =
| __GFP_COMP))
> +               return LAST_VIRT_ZONE;
> +
>         return z;
>  }
>
> +extern int zone_nomerge_order __read_mostly;
> +extern int zone_nosplit_order __read_mostly;
> +
> +static inline enum zone_type gfp_order_zone(gfp_t flags, int order)
> +{
> +       enum zone_type zid =3D gfp_zone(flags);
> +
> +       if (zid >=3D ZONE_NOMERGE && order !=3D zone_nomerge_order)
> +               zid =3D ZONE_NOMERGE - 1;
> +
> +       if (zid >=3D ZONE_NOSPLIT && order < zone_nosplit_order)
> +               zid =3D ZONE_NOSPLIT - 1;
> +
> +       return zid;
> +}
> +
>  /*
>   * There is only one page-allocator function, and two main namespaces to
>   * it. The alloc_page*() variants return 'struct page *' and as such
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..9960ad7c3b10 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp=
, unsigned long addr,
>                 unsigned long len, unsigned long pgoff, unsigned long fla=
gs);
>
>  void folio_prep_large_rmappable(struct folio *folio);
> -bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct=
 folio *folio) {}
>
>  #define thp_get_unmapped_area  NULL
>
> -static inline bool
> -can_split_folio(struct folio *folio, int *pextra_pins)
> -{
> -       return false;
> -}
>  static inline int
>  split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..a92bcf47cf8c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,7 +150,7 @@ extern enum zone_type policy_zone;
>
>  static inline void check_highest_zone(enum zone_type k)
>  {
> -       if (k > policy_zone && k !=3D ZONE_MOVABLE)
> +       if (k > policy_zone && !zid_is_virt(k))
>                 policy_zone =3D k;
>  }
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a497f189d988..532218167bba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -805,11 +805,15 @@ enum zone_type {
>          * there can be false negatives).
>          */
>         ZONE_MOVABLE,
> +       ZONE_NOSPLIT,
> +       ZONE_NOMERGE,
>  #ifdef CONFIG_ZONE_DEVICE
>         ZONE_DEVICE,
>  #endif
> -       __MAX_NR_ZONES
> +       __MAX_NR_ZONES,
>
> +       LAST_PHYS_ZONE =3D ZONE_MOVABLE - 1,
> +       LAST_VIRT_ZONE =3D ZONE_NOMERGE,
>  };
>
>  #ifndef __GENERATING_BOUNDS_H
> @@ -929,6 +933,8 @@ struct zone {
>         seqlock_t               span_seqlock;
>  #endif
>
> +       int order;
> +
>         int initialized;
>
>         /* Write-intensive fields used from the page allocator */
> @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const str=
uct folio *folio)
>
>  static inline bool is_zone_movable_page(const struct page *page)
>  {
> -       return page_zonenum(page) =3D=3D ZONE_MOVABLE;
> +       return page_zonenum(page) >=3D ZONE_MOVABLE;
>  }
>
>  static inline bool folio_is_zone_movable(const struct folio *folio)
>  {
> -       return folio_zonenum(folio) =3D=3D ZONE_MOVABLE;
> +       return folio_zonenum(folio) >=3D ZONE_MOVABLE;
> +}
> +
> +static inline bool page_can_split(struct page *page)
> +{
> +       return page_zonenum(page) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool folio_can_split(struct folio *folio)
> +{
> +       return folio_zonenum(folio) < ZONE_NOSPLIT;
>  }
>  #endif
>
> @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) {=
 return node_id; };
>   */
>  #define zone_idx(zone)         ((zone) - (zone)->zone_pgdat->node_zones)
>
> +static inline bool zid_is_virt(enum zone_type zid)
> +{
> +       return zid > LAST_PHYS_ZONE && zid <=3D LAST_VIRT_ZONE;
> +}
> +
> +static inline bool zone_can_frag(struct zone *zone)
> +{
> +       VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT);
> +
> +       return zone_idx(zone) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool zone_is_suitable(struct zone *zone, int order)
> +{
> +       int zid =3D zone_idx(zone);
> +
> +       if (zid < ZONE_NOSPLIT)
> +               return true;
> +
> +       if (!zone->order)
> +               return false;
> +
> +       return (zid =3D=3D ZONE_NOSPLIT && order >=3D zone->order) ||
> +              (zid =3D=3D ZONE_NOMERGE && order =3D=3D zone->order);
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  static inline bool zone_is_zone_device(struct zone *zone)
>  {
> @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone)
>  static inline void zone_set_nid(struct zone *zone, int nid) {}
>  #endif
>
> -extern int movable_zone;
> +extern int virt_zone;
>
>  static inline int is_highmem_idx(enum zone_type idx)
>  {
>  #ifdef CONFIG_HIGHMEM
>         return (idx =3D=3D ZONE_HIGHMEM ||
> -               (idx =3D=3D ZONE_MOVABLE && movable_zone =3D=3D ZONE_HIGH=
MEM));
> +               (zid_is_virt(idx) && virt_zone =3D=3D ZONE_HIGHMEM));
>  #else
>         return 0;
>  #endif
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index b61438313a73..34fbe910576d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -404,7 +404,7 @@ enum node_states {
>  #else
>         N_HIGH_MEMORY =3D N_NORMAL_MEMORY,
>  #endif
> -       N_MEMORY,               /* The node has memory(regular, high, mov=
able) */
> +       N_MEMORY,               /* The node has memory in any of the zone=
s */
>         N_CPU,          /* The node has one or more cpus */
>         N_GENERIC_INITIATOR,    /* The node has one or more Generic Initi=
ators */
>         NR_NODE_STATES
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.=
h
> index 747943bc8cc2..9a54d15d5ec3 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -27,7 +27,7 @@
>  #endif
>
>  #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \
> -       HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
> +       HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE=
_ZONE(xx)
>
>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 FOR_ALL_ZONES(PGALLOC)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflag=
s.h
> index d801409b33cf..2b5fdafaadea 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )=
               \
>         IFDEF_ZONE_DMA32(       EM (ZONE_DMA32,  "DMA32"))      \
>                                 EM (ZONE_NORMAL, "Normal")      \
>         IFDEF_ZONE_HIGHMEM(     EM (ZONE_HIGHMEM,"HighMem"))    \
> -                               EMe(ZONE_MOVABLE,"Movable")
> +                               EM (ZONE_MOVABLE,"Movable")     \
> +                               EM (ZONE_NOSPLIT,"NoSplit")     \
> +                               EMe(ZONE_NOMERGE,"NoMerge")
>
>  #define LRU_NAMES              \
>                 EM (LRU_INACTIVE_ANON, "inactive_anon") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..8a64c805f411 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_=
mask, unsigned int order,
>                                         ac->highest_zoneidx, ac->nodemask=
) {
>                 enum compact_result status;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 if (prio > MIN_COMPACT_PRIORITY
>                                         && compaction_deferred(zone, orde=
r)) {
>                         rc =3D max_t(enum compact_result, COMPACT_DEFERRE=
D, rc);
> @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat=
)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone =3D zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2846,6 +2852,9 @@ static void compact_node(int nid)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone =3D zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgda=
t)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 ret =3D compaction_suit_allocation_order(zone,
>                                 pgdat->kcompactd_max_order,
>                                 highest_zoneidx, ALLOC_WMARK_MIN);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c958f7ebb5..b57faa0a1e83 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, =
struct list_head *list,
>  }
>
>  /* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int *pextra_pins)
> +static bool can_split_folio(struct folio *folio, int *pextra_pins)
>  {
>         int extra_pins;
>
> +       if (!folio_can_split(folio))
> +               return false;
> +
>         /* Additional pins from page cache */
>         if (folio_test_anon(folio))
>                 extra_pins =3D folio_test_swapcache(folio) ?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..1f84dd759086 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma)
>
>  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  {
> -       enum zone_type dynamic_policy_zone =3D policy_zone;
> -
> -       BUG_ON(dynamic_policy_zone =3D=3D ZONE_MOVABLE);
> +       WARN_ON_ONCE(zid_is_virt(policy_zone));
>
>         /*
> -        * if policy->nodes has movable memory only,
> -        * we apply policy when gfp_zone(gfp) =3D ZONE_MOVABLE only.
> +        * If policy->nodes has memory in virtual zones only, we apply po=
licy
> +        * only if gfp_zone(gfp) can allocate from those zones.
>          *
>          * policy->nodes is intersect with node_states[N_MEMORY].
>          * so if the following test fails, it implies
> -        * policy->nodes has movable memory only.
> +        * policy->nodes has memory in virtual zones only.
>          */
>         if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
> -               dynamic_policy_zone =3D ZONE_MOVABLE;
> +               return zone > LAST_PHYS_ZONE;
>
> -       return zone >=3D dynamic_policy_zone;
> +       return zone >=3D policy_zone;
>  }
>
>  /* Do dynamic interleaving for a process */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cc9f2bcd73b4..f615c0c22046 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *fol=
io, struct list_head *split_f
>  {
>         int rc;
>
> +       if (!folio_can_split(folio))
> +               return -EBUSY;
> +
>         folio_lock(folio);
>         rc =3D split_folio_to_list(folio, split_folios);
>         folio_unlock(folio);
> @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *=
src, unsigned long private)
>                 order =3D folio_order(src);
>         }
>         zidx =3D zone_idx(folio_zone(src));
> -       if (is_highmem_idx(zidx) || zidx =3D=3D ZONE_MOVABLE)
> +       if (zidx > ZONE_NORMAL)
>                 gfp_mask |=3D __GFP_HIGHMEM;
>
>         return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
> @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgd=
at, struct folio *folio)
>                                 break;
>                 }
>                 wakeup_kswapd(pgdat->node_zones + z, 0,
> -                             folio_order(folio), ZONE_MOVABLE);
> +                             folio_order(folio), z);
>                 return 0;
>         }
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36..7769c21e6d54 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init);
>
>  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initd=
ata;
>  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __init=
data;
> -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
>
> -static unsigned long required_kernelcore __initdata;
> -static unsigned long required_kernelcore_percent __initdata;
> -static unsigned long required_movablecore __initdata;
> -static unsigned long required_movablecore_percent __initdata;
> +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUM=
NODES] __initdata;
> +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid])
> +
> +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] =
__initdata;
> +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE])
> +
> +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1=
] __initdata;
> +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE])
> +
> +int zone_nosplit_order __read_mostly;
> +int zone_nomerge_order __read_mostly;
>
>  static unsigned long nr_kernel_pages __initdata;
>  static unsigned long nr_all_pages __initdata;
> @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p)
>                 return 0;
>         }
>
> -       return cmdline_parse_core(p, &required_kernelcore,
> -                                 &required_kernelcore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE),
> +                                 &percentage_of(LAST_PHYS_ZONE));
>  }
>  early_param("kernelcore", cmdline_parse_kernelcore);
>
> @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore)=
;
>   */
>  static int __init cmdline_parse_movablecore(char *p)
>  {
> -       return cmdline_parse_core(p, &required_movablecore,
> -                                 &required_movablecore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE),
> +                                 &percentage_of(ZONE_MOVABLE));
>  }
>  early_param("movablecore", cmdline_parse_movablecore);
>
> +static int __init parse_zone_order(char *p, unsigned long *nr_pages,
> +                                  unsigned long *percent, int *order)
> +{
> +       int err;
> +       unsigned long n;
> +       char *s =3D strchr(p, ',');
> +
> +       if (!s)
> +               return -EINVAL;
> +
> +       *s++ =3D '\0';
> +
> +       err =3D kstrtoul(s, 0, &n);
> +       if (err)
> +               return err;
> +
> +       if (n < 2 || n > MAX_PAGE_ORDER)
> +               return -EINVAL;
> +
> +       err =3D cmdline_parse_core(p, nr_pages, percent);
> +       if (err)
> +               return err;
> +
> +       *order =3D n;
> +
> +       return 0;
> +}
> +
> +static int __init parse_zone_nosplit(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT),
> +                               &percentage_of(ZONE_NOSPLIT), &zone_nospl=
it_order);
> +}
> +early_param("nosplit", parse_zone_nosplit);
> +
> +static int __init parse_zone_nomerge(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE),
> +                               &percentage_of(ZONE_NOMERGE), &zone_nomer=
ge_order);
> +}
> +early_param("nomerge", parse_zone_nomerge);
> +
>  /*
>   * early_calculate_totalpages()
> - * Sum pages in active regions for movable zone.
> + * Sum pages in active regions for virtual zones.
>   * Populate N_MEMORY for calculating usable_nodes.
>   */
>  static unsigned long __init early_calculate_totalpages(void)
> @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalp=
ages(void)
>  }
>
>  /*
> - * This finds a zone that can be used for ZONE_MOVABLE pages. The
> + * This finds a physical zone that can be used for virtual zones. The
>   * assumption is made that zones within a node are ordered in monotonic
>   * increasing memory addresses so that the "highest" populated zone is u=
sed
>   */
> -static void __init find_usable_zone_for_movable(void)
> +static void __init find_usable_zone(void)
>  {
>         int zone_index;
> -       for (zone_index =3D MAX_NR_ZONES - 1; zone_index >=3D 0; zone_ind=
ex--) {
> -               if (zone_index =3D=3D ZONE_MOVABLE)
> -                       continue;
> -
> +       for (zone_index =3D LAST_PHYS_ZONE; zone_index >=3D 0; zone_index=
--) {
>                 if (arch_zone_highest_possible_pfn[zone_index] >
>                                 arch_zone_lowest_possible_pfn[zone_index]=
)
>                         break;
>         }
>
>         VM_BUG_ON(zone_index =3D=3D -1);
> -       movable_zone =3D zone_index;
> +       virt_zone =3D zone_index;
> +}
> +
> +static void __init find_virt_zone(unsigned long occupied, unsigned long =
*zone_pfn)
> +{
> +       int i, nid;
> +       unsigned long node_avg, remaining;
> +       int usable_nodes =3D nodes_weight(node_states[N_MEMORY]);
> +       /* usable_startpfn is the lowest possible pfn virtual zones can b=
e at */
> +       unsigned long usable_startpfn =3D arch_zone_lowest_possible_pfn[v=
irt_zone];
> +
> +restart:
> +       /* Carve out memory as evenly as possible throughout nodes */
> +       node_avg =3D occupied / usable_nodes;
> +       for_each_node_state(nid, N_MEMORY) {
> +               unsigned long start_pfn, end_pfn;
> +
> +               /*
> +                * Recalculate node_avg if the division per node now exce=
eds
> +                * what is necessary to satisfy the amount of memory to c=
arve
> +                * out.
> +                */
> +               if (occupied < node_avg)
> +                       node_avg =3D occupied / usable_nodes;
> +
> +               /*
> +                * As the map is walked, we track how much memory is usab=
le
> +                * using remaining. When it is 0, the rest of the node is
> +                * usable.
> +                */
> +               remaining =3D node_avg;
> +
> +               /* Go through each range of PFNs within this node */
> +               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL=
) {
> +                       unsigned long size_pages;
> +
> +                       start_pfn =3D max(start_pfn, zone_pfn[nid]);
> +                       if (start_pfn >=3D end_pfn)
> +                               continue;
> +
> +                       /* Account for what is only usable when carving o=
ut */
> +                       if (start_pfn < usable_startpfn) {
> +                               unsigned long nr_pages =3D min(end_pfn, u=
sable_startpfn) - start_pfn;
> +
> +                               remaining -=3D min(nr_pages, remaining);
> +                               occupied -=3D min(nr_pages, occupied);
> +
> +                               /* Continue if range is now fully account=
ed */
> +                               if (end_pfn <=3D usable_startpfn) {
> +
> +                                       /*
> +                                        * Push zone_pfn to the end so th=
at if
> +                                        * we have to carve out more acro=
ss
> +                                        * nodes, we will not double acco=
unt
> +                                        * here.
> +                                        */
> +                                       zone_pfn[nid] =3D end_pfn;
> +                                       continue;
> +                               }
> +                               start_pfn =3D usable_startpfn;
> +                       }
> +
> +                       /*
> +                        * The usable PFN range is from start_pfn->end_pf=
n.
> +                        * Calculate size_pages as the number of pages us=
ed.
> +                        */
> +                       size_pages =3D end_pfn - start_pfn;
> +                       if (size_pages > remaining)
> +                               size_pages =3D remaining;
> +                       zone_pfn[nid] =3D start_pfn + size_pages;
> +
> +                       /*
> +                        * Some memory was carved out, update counts and =
break
> +                        * if the request for this node has been satisfie=
d.
> +                        */
> +                       occupied -=3D min(occupied, size_pages);
> +                       remaining -=3D size_pages;
> +                       if (!remaining)
> +                               break;
> +               }
> +       }
> +
> +       /*
> +        * If there is still more to carve out, we do another pass with o=
ne less
> +        * node in the count. This will push zone_pfn[nid] further along =
on the
> +        * nodes that still have memory until the request is fully satisf=
ied.
> +        */
> +       usable_nodes--;
> +       if (usable_nodes && occupied > usable_nodes)
> +               goto restart;
>  }
>
>  /*
> @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(voi=
d)
>   * memory. When they don't, some nodes will have more kernelcore than
>   * others
>   */
> -static void __init find_zone_movable_pfns_for_nodes(void)
> +static void __init find_virt_zones(void)
>  {
> -       int i, nid;
> +       int i;
> +       int nid;
>         unsigned long usable_startpfn;
> -       unsigned long kernelcore_node, kernelcore_remaining;
>         /* save the state before borrow the nodemask */
>         nodemask_t saved_node_state =3D node_states[N_MEMORY];
>         unsigned long totalpages =3D early_calculate_totalpages();
> -       int usable_nodes =3D nodes_weight(node_states[N_MEMORY]);
>         struct memblock_region *r;
> +       unsigned long occupied =3D 0;
>
> -       /* Need to find movable_zone earlier when movable_node is specifi=
ed. */
> -       find_usable_zone_for_movable();
> +       /* Need to find virt_zone earlier when movable_node is specified.=
 */
> +       find_usable_zone();
>
>         /*
>          * If movable_node is specified, ignore kernelcore and movablecor=
e
> @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(v=
oid)
>                         nid =3D memblock_get_region_node(r);
>
>                         usable_startpfn =3D PFN_DOWN(r->base);
> -                       zone_movable_pfn[nid] =3D zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid=
]) :
> +                       pfn_of(ZONE_MOVABLE, nid) =3D pfn_of(ZONE_MOVABLE=
, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE,=
 nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(v=
oid)
>                                 continue;
>                         }
>
> -                       zone_movable_pfn[nid] =3D zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid=
]) :
> +                       pfn_of(ZONE_MOVABLE, nid) =3D pfn_of(ZONE_MOVABLE=
, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE,=
 nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_node=
s(void)
>                 goto out2;
>         }
>
> +       if (zone_nomerge_order && zone_nomerge_order <=3D zone_nosplit_or=
der) {
> +               nr_pages_of(ZONE_NOSPLIT) =3D nr_pages_of(ZONE_NOMERGE) =
=3D 0;
> +               percentage_of(ZONE_NOSPLIT) =3D percentage_of(ZONE_NOMERG=
E) =3D 0;
> +               zone_nosplit_order =3D zone_nomerge_order =3D 0;
> +               pr_warn("zone %s order %d must be higher zone %s order %d=
\n",
> +                       zone_names[ZONE_NOMERGE], zone_nomerge_order,
> +                       zone_names[ZONE_NOSPLIT], zone_nosplit_order);
> +       }
> +
>         /*
>          * If kernelcore=3Dnn% or movablecore=3Dnn% was specified, calcul=
ate the
>          * amount of necessary memory.
>          */
> -       if (required_kernelcore_percent)
> -               required_kernelcore =3D (totalpages * 100 * required_kern=
elcore_percent) /
> -                                      10000UL;
> -       if (required_movablecore_percent)
> -               required_movablecore =3D (totalpages * 100 * required_mov=
ablecore_percent) /
> -                                       10000UL;
> +       for (i =3D LAST_PHYS_ZONE; i <=3D LAST_VIRT_ZONE; i++) {
> +               if (percentage_of(i))
> +                       nr_pages_of(i) =3D totalpages * percentage_of(i) =
/ 100;
> +
> +               nr_pages_of(i) =3D roundup(nr_pages_of(i), MAX_ORDER_NR_P=
AGES);
> +               occupied +=3D nr_pages_of(i);
> +       }
>
>         /*
>          * If movablecore=3D was specified, calculate what size of
>          * kernelcore that corresponds so that memory usable for
>          * any allocation type is evenly spread. If both kernelcore
>          * and movablecore are specified, then the value of kernelcore
> -        * will be used for required_kernelcore if it's greater than
> -        * what movablecore would have allowed.
> +        * will be used if it's greater than what movablecore would have
> +        * allowed.
>          */
> -       if (required_movablecore) {
> -               unsigned long corepages;
> +       if (occupied < totalpages) {
> +               enum zone_type zid;
>
> -               /*
> -                * Round-up so that ZONE_MOVABLE is at least as large as =
what
> -                * was requested by the user
> -                */
> -               required_movablecore =3D
> -                       roundup(required_movablecore, MAX_ORDER_NR_PAGES)=
;
> -               required_movablecore =3D min(totalpages, required_movable=
core);
> -               corepages =3D totalpages - required_movablecore;
> -
> -               required_kernelcore =3D max(required_kernelcore, corepage=
s);
> +               zid =3D !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_=
MOVABLE) ?
> +                     LAST_PHYS_ZONE : ZONE_MOVABLE;
> +               nr_pages_of(zid) +=3D totalpages - occupied;
>         }
>
>         /*
>          * If kernelcore was not specified or kernelcore size is larger
> -        * than totalpages, there is no ZONE_MOVABLE.
> +        * than totalpages, there are not virtual zones.
>          */
> -       if (!required_kernelcore || required_kernelcore >=3D totalpages)
> +       occupied =3D nr_pages_of(LAST_PHYS_ZONE);
> +       if (!occupied || occupied >=3D totalpages)
>                 goto out;
>
> -       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be=
 at */
> -       usable_startpfn =3D arch_zone_lowest_possible_pfn[movable_zone];
> +       for (i =3D LAST_PHYS_ZONE + 1; i <=3D LAST_VIRT_ZONE; i++) {
> +               if (!nr_pages_of(i))
> +                       continue;
>
> -restart:
> -       /* Spread kernelcore memory as evenly as possible throughout node=
s */
> -       kernelcore_node =3D required_kernelcore / usable_nodes;
> -       for_each_node_state(nid, N_MEMORY) {
> -               unsigned long start_pfn, end_pfn;
> -
> -               /*
> -                * Recalculate kernelcore_node if the division per node
> -                * now exceeds what is necessary to satisfy the requested
> -                * amount of memory for the kernel
> -                */
> -               if (required_kernelcore < kernelcore_node)
> -                       kernelcore_node =3D required_kernelcore / usable_=
nodes;
> -
> -               /*
> -                * As the map is walked, we track how much memory is usab=
le
> -                * by the kernel using kernelcore_remaining. When it is
> -                * 0, the rest of the node is usable by ZONE_MOVABLE
> -                */
> -               kernelcore_remaining =3D kernelcore_node;
> -
> -               /* Go through each range of PFNs within this node */
> -               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL=
) {
> -                       unsigned long size_pages;
> -
> -                       start_pfn =3D max(start_pfn, zone_movable_pfn[nid=
]);
> -                       if (start_pfn >=3D end_pfn)
> -                               continue;
> -
> -                       /* Account for what is only usable for kernelcore=
 */
> -                       if (start_pfn < usable_startpfn) {
> -                               unsigned long kernel_pages;
> -                               kernel_pages =3D min(end_pfn, usable_star=
tpfn)
> -                                                               - start_p=
fn;
> -
> -                               kernelcore_remaining -=3D min(kernel_page=
s,
> -                                                       kernelcore_remain=
ing);
> -                               required_kernelcore -=3D min(kernel_pages=
,
> -                                                       required_kernelco=
re);
> -
> -                               /* Continue if range is now fully account=
ed */
> -                               if (end_pfn <=3D usable_startpfn) {
> -
> -                                       /*
> -                                        * Push zone_movable_pfn to the e=
nd so
> -                                        * that if we have to rebalance
> -                                        * kernelcore across nodes, we wi=
ll
> -                                        * not double account here
> -                                        */
> -                                       zone_movable_pfn[nid] =3D end_pfn=
;
> -                                       continue;
> -                               }
> -                               start_pfn =3D usable_startpfn;
> -                       }
> -
> -                       /*
> -                        * The usable PFN range for ZONE_MOVABLE is from
> -                        * start_pfn->end_pfn. Calculate size_pages as th=
e
> -                        * number of pages used as kernelcore
> -                        */
> -                       size_pages =3D end_pfn - start_pfn;
> -                       if (size_pages > kernelcore_remaining)
> -                               size_pages =3D kernelcore_remaining;
> -                       zone_movable_pfn[nid] =3D start_pfn + size_pages;
> -
> -                       /*
> -                        * Some kernelcore has been met, update counts an=
d
> -                        * break if the kernelcore for this node has been
> -                        * satisfied
> -                        */
> -                       required_kernelcore -=3D min(required_kernelcore,
> -                                                               size_page=
s);
> -                       kernelcore_remaining -=3D size_pages;
> -                       if (!kernelcore_remaining)
> -                               break;
> -               }
> +               find_virt_zone(occupied, &pfn_of(i, 0));
> +               occupied +=3D nr_pages_of(i);
>         }
> -
> -       /*
> -        * If there is still required_kernelcore, we do another pass with=
 one
> -        * less node in the count. This will push zone_movable_pfn[nid] f=
urther
> -        * along on the nodes that still have memory until kernelcore is
> -        * satisfied
> -        */
> -       usable_nodes--;
> -       if (usable_nodes && required_kernelcore > usable_nodes)
> -               goto restart;
> -
>  out2:
> -       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES =
*/
> +       /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGE=
S */
>         for (nid =3D 0; nid < MAX_NUMNODES; nid++) {
>                 unsigned long start_pfn, end_pfn;
> -
> -               zone_movable_pfn[nid] =3D
> -                       roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES=
);
> +               unsigned long prev_virt_zone_pfn =3D 0;
>
>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> -               if (zone_movable_pfn[nid] >=3D end_pfn)
> -                       zone_movable_pfn[nid] =3D 0;
> +
> +               for (i =3D LAST_PHYS_ZONE + 1; i <=3D LAST_VIRT_ZONE; i++=
) {
> +                       pfn_of(i, nid) =3D roundup(pfn_of(i, nid), MAX_OR=
DER_NR_PAGES);
> +
> +                       if (pfn_of(i, nid) <=3D prev_virt_zone_pfn || pfn=
_of(i, nid) >=3D end_pfn)
> +                               pfn_of(i, nid) =3D 0;
> +
> +                       if (pfn_of(i, nid))
> +                               prev_virt_zone_pfn =3D pfn_of(i, nid);
> +               }
>         }
> -
>  out:
>         /* restore the node_state */
>         node_states[N_MEMORY] =3D saved_node_state;
> @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *z=
one,
>  #endif
>
>  /*
> - * The zone ranges provided by the architecture do not include ZONE_MOVA=
BLE
> - * because it is sized independent of architecture. Unlike the other zon=
es,
> - * the starting point for ZONE_MOVABLE is not fixed. It may be different
> - * in each node depending on the size of each node and how evenly kernel=
core
> - * is distributed. This helper function adjusts the zone ranges
> + * The zone ranges provided by the architecture do not include virtual z=
ones
> + * because they are sized independent of architecture. Unlike physical z=
ones,
> + * the starting point for the first populated virtual zone is not fixed.=
 It may
> + * be different in each node depending on the size of each node and how =
evenly
> + * kernelcore is distributed. This helper function adjusts the zone rang=
es
>   * provided by the architecture for a given node by using the end of the
> - * highest usable zone for ZONE_MOVABLE. This preserves the assumption t=
hat
> - * zones within a node are in order of monotonic increases memory addres=
ses
> + * highest usable zone for the first populated virtual zone. This preser=
ves the
> + * assumption that zones within a node are in order of monotonic increas=
es
> + * memory addresses.
>   */
> -static void __init adjust_zone_range_for_zone_movable(int nid,
> +static void __init adjust_zone_range(int nid,
>                                         unsigned long zone_type,
>                                         unsigned long node_end_pfn,
>                                         unsigned long *zone_start_pfn,
>                                         unsigned long *zone_end_pfn)
>  {
> -       /* Only adjust if ZONE_MOVABLE is on this node */
> -       if (zone_movable_pfn[nid]) {
> -               /* Size ZONE_MOVABLE */
> -               if (zone_type =3D=3D ZONE_MOVABLE) {
> -                       *zone_start_pfn =3D zone_movable_pfn[nid];
> -                       *zone_end_pfn =3D min(node_end_pfn,
> -                               arch_zone_highest_possible_pfn[movable_zo=
ne]);
> +       int i =3D max_t(int, zone_type, LAST_PHYS_ZONE);
> +       unsigned long next_virt_zone_pfn =3D 0;
>
> -               /* Adjust for ZONE_MOVABLE starting within this range */
> -               } else if (!mirrored_kernelcore &&
> -                       *zone_start_pfn < zone_movable_pfn[nid] &&
> -                       *zone_end_pfn > zone_movable_pfn[nid]) {
> -                       *zone_end_pfn =3D zone_movable_pfn[nid];
> +       while (i++ < LAST_VIRT_ZONE) {
> +               if (pfn_of(i, nid)) {
> +                       next_virt_zone_pfn =3D pfn_of(i, nid);
> +                       break;
> +               }
> +       }
>
> -               /* Check if this whole range is within ZONE_MOVABLE */
> -               } else if (*zone_start_pfn >=3D zone_movable_pfn[nid])
> +       if (zone_type <=3D LAST_PHYS_ZONE) {
> +               if (!next_virt_zone_pfn)
> +                       return;
> +
> +               if (!mirrored_kernelcore &&
> +                   *zone_start_pfn < next_virt_zone_pfn &&
> +                   *zone_end_pfn > next_virt_zone_pfn)
> +                       *zone_end_pfn =3D next_virt_zone_pfn;
> +               else if (*zone_start_pfn >=3D next_virt_zone_pfn)
>                         *zone_start_pfn =3D *zone_end_pfn;
> +       } else if (zone_type <=3D LAST_VIRT_ZONE) {
> +               if (!pfn_of(zone_type, nid))
> +                       return;
> +
> +               if (next_virt_zone_pfn)
> +                       *zone_end_pfn =3D min3(next_virt_zone_pfn,
> +                                            node_end_pfn,
> +                                            arch_zone_highest_possible_p=
fn[virt_zone]);
> +               else
> +                       *zone_end_pfn =3D min(node_end_pfn,
> +                                           arch_zone_highest_possible_pf=
n[virt_zone]);
> +               *zone_start_pfn =3D min(*zone_end_pfn, pfn_of(zone_type, =
nid));
>         }
>  }
>
> @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_no=
de(int nid,
>          * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
>          * and vice versa.
>          */
> -       if (mirrored_kernelcore && zone_movable_pfn[nid]) {
> +       if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) {
>                 unsigned long start_pfn, end_pfn;
>                 struct memblock_region *r;
>
> @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_n=
ode(int nid,
>         /* Get the start and end of the zone */
>         *zone_start_pfn =3D clamp(node_start_pfn, zone_low, zone_high);
>         *zone_end_pfn =3D clamp(node_end_pfn, zone_low, zone_high);
> -       adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn,
> -                                          zone_start_pfn, zone_end_pfn);
> +       adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, z=
one_end_pfn);
>
>         /* Check that this node has pages within the zone's required rang=
e */
>         if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_=
pfn)
> @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struc=
t pglist_data *pgdat,
>  #if defined(CONFIG_MEMORY_HOTPLUG)
>                 zone->present_early_pages =3D real_size;
>  #endif
> +               if (i =3D=3D ZONE_NOSPLIT)
> +                       zone->order =3D zone_nosplit_order;
> +               if (i =3D=3D ZONE_NOMERGE)
> +                       zone->order =3D zone_nomerge_order;
>
>                 totalpages +=3D spanned;
>                 realtotalpages +=3D real_size;
> @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgda=
t)
>  {
>         enum zone_type zone_type;
>
> -       for (zone_type =3D 0; zone_type <=3D ZONE_MOVABLE - 1; zone_type+=
+) {
> +       for (zone_type =3D 0; zone_type <=3D LAST_PHYS_ZONE; zone_type++)=
 {
>                 struct zone *zone =3D &pgdat->node_zones[zone_type];
>                 if (populated_zone(zone)) {
>                         if (IS_ENABLED(CONFIG_HIGHMEM))
> @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void)
>  void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>         unsigned long start_pfn, end_pfn;
> -       int i, nid, zone;
> +       int i, j, nid, zone;
>         bool descending;
>
>         /* Record where the zone boundaries are */
> @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zon=
e_pfn)
>         start_pfn =3D PHYS_PFN(memblock_start_of_DRAM());
>         descending =3D arch_has_descending_max_zone_pfns();
>
> -       for (i =3D 0; i < MAX_NR_ZONES; i++) {
> +       for (i =3D 0; i <=3D LAST_PHYS_ZONE; i++) {
>                 if (descending)
> -                       zone =3D MAX_NR_ZONES - i - 1;
> +                       zone =3D LAST_PHYS_ZONE - i;
>                 else
>                         zone =3D i;
>
> -               if (zone =3D=3D ZONE_MOVABLE)
> -                       continue;
> -
>                 end_pfn =3D max(max_zone_pfn[zone], start_pfn);
>                 arch_zone_lowest_possible_pfn[zone] =3D start_pfn;
>                 arch_zone_highest_possible_pfn[zone] =3D end_pfn;
> @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zon=
e_pfn)
>                 start_pfn =3D end_pfn;
>         }
>
> -       /* Find the PFNs that ZONE_MOVABLE begins at in each node */
> -       memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> -       find_zone_movable_pfns_for_nodes();
> +       /* Find the PFNs that virtual zones begin at in each node */
> +       find_virt_zones();
>
>         /* Print out the zone ranges */
>         pr_info("Zone ranges:\n");
> -       for (i =3D 0; i < MAX_NR_ZONES; i++) {
> -               if (i =3D=3D ZONE_MOVABLE)
> -                       continue;
> +       for (i =3D 0; i <=3D LAST_PHYS_ZONE; i++) {
>                 pr_info("  %-8s ", zone_names[i]);
>                 if (arch_zone_lowest_possible_pfn[i] =3D=3D
>                                 arch_zone_highest_possible_pfn[i])
> @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zon=
e_pfn)
>                                         << PAGE_SHIFT) - 1);
>         }
>
> -       /* Print out the PFNs ZONE_MOVABLE begins at in each node */
> -       pr_info("Movable zone start for each node\n");
> -       for (i =3D 0; i < MAX_NUMNODES; i++) {
> -               if (zone_movable_pfn[i])
> -                       pr_info("  Node %d: %#018Lx\n", i,
> -                              (u64)zone_movable_pfn[i] << PAGE_SHIFT);
> +       /* Print out the PFNs virtual zones begin at in each node */
> +       for (; i <=3D LAST_VIRT_ZONE; i++) {
> +               pr_info("%s zone start for each node\n", zone_names[i]);
> +               for (j =3D 0; j < MAX_NUMNODES; j++) {
> +                       if (pfn_of(i, j))
> +                               pr_info("  Node %d: %#018Lx\n",
> +                                       j, (u64)pfn_of(i, j) << PAGE_SHIF=
T);
> +               }
>         }
>
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..6a4da8f8691c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] =3D {
>          "HighMem",
>  #endif
>          "Movable",
> +        "NoSplit",
> +        "NoMerge",
>  #ifdef CONFIG_ZONE_DEVICE
>          "Device",
>  #endif
> @@ -290,9 +292,9 @@ int user_min_free_kbytes =3D -1;
>  static int watermark_boost_factor __read_mostly =3D 15000;
>  static int watermark_scale_factor =3D 10;
>
> -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from =
*/
> -int movable_zone;
> -EXPORT_SYMBOL(movable_zone);
> +/* virt_zone is the "real" zone pages in virtual zones are taken from */
> +int virt_zone;
> +EXPORT_SYMBOL(virt_zone);
>
>  #if MAX_NUMNODES > 1
>  unsigned int nr_node_ids __read_mostly =3D MAX_NUMNODES;
> @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long b=
uddy_pfn,
>         unsigned long higher_page_pfn;
>         struct page *higher_page;
>
> -       if (order >=3D MAX_PAGE_ORDER - 1)
> -               return false;
> -
>         higher_page_pfn =3D buddy_pfn & pfn;
>         higher_page =3D page + (higher_page_pfn - pfn);
>
> @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long =
buddy_pfn,
>                         NULL) !=3D NULL;
>  }
>
> +static int zone_max_order(struct zone *zone)
> +{
> +       return zone->order && zone_idx(zone) =3D=3D ZONE_NOMERGE ? zone->=
order : MAX_PAGE_ORDER;
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page,
>         unsigned long combined_pfn;
>         struct page *buddy;
>         bool to_tail;
> +       int max_order =3D zone_max_order(zone);
>
>         VM_BUG_ON(!zone_is_initialized(zone));
>         VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page,
>         VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>         VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> -       while (order < MAX_PAGE_ORDER) {
> +       while (order < max_order) {
>                 if (compaction_capture(capc, page, order, migratetype)) {
>                         __mod_zone_freepage_state(zone, -(1 << order),
>                                                                 migratety=
pe);
> @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page,
>                 to_tail =3D true;
>         else if (is_shuffle_order(order))
>                 to_tail =3D shuffle_pick_tail();
> +       else if (order + 1 >=3D max_order)
> +               to_tail =3D false;
>         else
>                 to_tail =3D buddy_merge_likely(pfn, buddy_pfn, page, orde=
r);
>
> @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page,
>         int mt;
>         int ret =3D 0;
>
> +       VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page);
> +
>         if (split_pfn_offset =3D=3D 0)
>                 return ret;
>
> @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, =
unsigned int order,
>         struct free_area *area;
>         struct page *page;
>
> +       VM_WARN_ON_ONCE(!zone_is_suitable(zone, order));
> +
>         /* Find a page of the appropriate size in the preferred list */
>         for (current_order =3D order; current_order < NR_PAGE_ORDERS; ++c=
urrent_order) {
>                 area =3D &(zone->free_area[current_order]);
> @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned i=
nt order, unsigned long mark,
>         long min =3D mark;
>         int o;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         /* free_pages may go negative - that's OK */
>         free_pages -=3D __zone_watermark_unusable_free(z, order, alloc_fl=
ags);
>
> @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone =
*z, unsigned int order,
>  {
>         long free_pages;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         free_pages =3D zone_page_state(z, NR_FREE_PAGES);
>
>         /*
> @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int=
 order, int alloc_flags,
>                 struct page *page;
>                 unsigned long mark;
>
> +               if (!zone_is_suitable(zone, order))
> +                       continue;
> +
>                 if (cpusets_enabled() &&
>                         (alloc_flags & ALLOC_CPUSET) &&
>                         !__cpuset_zone_allowed(zone, gfp_mask))
> @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void)
>         struct zone *zone;
>         unsigned long flags;
>
> -       /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE page=
s */
> +       /* Calculate total number of pages below ZONE_HIGHMEM */
>         for_each_zone(zone) {
> -               if (!is_highmem(zone) && zone_idx(zone) !=3D ZONE_MOVABLE=
)
> +               if (zone_idx(zone) <=3D ZONE_NORMAL)
>                         lowmem_pages +=3D zone_managed_pages(zone);
>         }
>
> @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void)
>                 spin_lock_irqsave(&zone->lock, flags);
>                 tmp =3D (u64)pages_min * zone_managed_pages(zone);
>                 do_div(tmp, lowmem_pages);
> -               if (is_highmem(zone) || zone_idx(zone) =3D=3D ZONE_MOVABL=
E) {
> +               if (zone_idx(zone) > ZONE_NORMAL) {
>                         /*
>                          * __GFP_HIGH and PF_MEMALLOC allocations usually=
 don't
> -                        * need highmem and movable zones pages, so cap p=
ages_min
> -                        * to a small  value here.
> +                        * need pages from zones above ZONE_NORMAL, so ca=
p
> +                        * pages_min to a small value here.
>                          *
>                          * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_=
MIN)
>                          * deltas control async page reclaim, and so shou=
ld
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index cd0ea3668253..8a6473543427 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long s=
tart_pfn, unsigned long e
>                  * pages then it should be reasonably safe to assume the =
rest
>                  * is movable.
>                  */
> -               if (zone_idx(zone) =3D=3D ZONE_MOVABLE)
> +               if (zid_is_virt(zone_idx(zone)))
>                         continue;
>
>                 /*
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..ad0db0373b05 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>         entry.val =3D 0;
>
>         if (folio_test_large(folio)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported=
())
> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported=
() &&
> +                   folio_test_pmd_mappable(folio))
>                         get_swap_pages(1, &entry, folio_nr_pages(folio));
>                 goto out;
>         }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..ae061ec4866a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list=
_head *folio_list,
>                                         goto keep_locked;
>                                 if (folio_maybe_dma_pinned(folio))
>                                         goto keep_locked;
> -                               if (folio_test_large(folio)) {
> -                                       /* cannot split folio, skip it */
> -                                       if (!can_split_folio(folio, NULL)=
)
> -                                               goto activate_locked;
> -                                       /*
> -                                        * Split folios without a PMD map=
 right
> -                                        * away. Chances are some or all =
of the
> -                                        * tail pages can be freed withou=
t IO.
> -                                        */
> -                                       if (!folio_entire_mapcount(folio)=
 &&
> -                                           split_folio_to_list(folio,
> -                                                               folio_lis=
t))
> -                                               goto activate_locked;
> -                               }
> +                               /*
> +                                * Split folios that are not fully map ri=
ght
> +                                * away. Chances are some of the tail pag=
es can
> +                                * be freed without IO.
> +                                */
> +                               if (folio_test_large(folio) &&
> +                                   atomic_read(&folio->_nr_pages_mapped)=
 < nr_pages)
> +                                       split_folio_to_list(folio, folio_=
list);
>                                 if (!add_to_swap(folio)) {
>                                         if (!folio_test_large(folio))
>                                                 goto activate_locked_spli=
t;
> @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist,=
 struct scan_control *sc)
>         orig_mask =3D sc->gfp_mask;
>         if (buffer_heads_over_limit) {
>                 sc->gfp_mask |=3D __GFP_HIGHMEM;
> -               sc->reclaim_idx =3D gfp_zone(sc->gfp_mask);
> +               sc->reclaim_idx =3D gfp_order_zone(sc->gfp_mask, sc->orde=
r);
>         }
>
>         for_each_zone_zonelist_nodemask(zone, z, zonelist,
> @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zo=
nelist, int order,
>         struct scan_control sc =3D {
>                 .nr_to_reclaim =3D SWAP_CLUSTER_MAX,
>                 .gfp_mask =3D current_gfp_context(gfp_mask),
> -               .reclaim_idx =3D gfp_zone(gfp_mask),
> +               .reclaim_idx =3D gfp_order_zone(gfp_mask, order),
>                 .order =3D order,
>                 .nodemask =3D nodemask,
>                 .priority =3D DEF_PRIORITY,
> @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_fl=
ags, int order,
>         if (!cpuset_zone_allowed(zone, gfp_flags))
>                 return;
>
> +       curr_idx =3D gfp_order_zone(gfp_flags, order);
> +       if (highest_zoneidx > curr_idx)
> +               highest_zoneidx =3D curr_idx;
> +
>         pgdat =3D zone->zone_pgdat;
>         curr_idx =3D READ_ONCE(pgdat->kswapd_highest_zoneidx);
>
> @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat=
, gfp_t gfp_mask, unsigned in
>                 .may_writepage =3D !!(node_reclaim_mode & RECLAIM_WRITE),
>                 .may_unmap =3D !!(node_reclaim_mode & RECLAIM_UNMAP),
>                 .may_swap =3D 1,
> -               .reclaim_idx =3D gfp_zone(gfp_mask),
> +               .reclaim_idx =3D gfp_order_zone(gfp_mask, order),
>         };
>         unsigned long pflags;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..adbd032e6a0f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned=
 int order)
>
>  #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_nor=
mal", \
>                                         TEXT_FOR_HIGHMEM(xx) xx "_movable=
", \
> +                                       xx "_nosplit", xx "_nomerge", \
>                                         TEXT_FOR_DEVICE(xx)
>
>  const char * const vmstat_text[] =3D {
> @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m,=
 pg_data_t *pgdat,
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu"
> -                  "\n        cma      %lu",
> +                  "\n        cma      %lu"
> +                  "\n        order    %u",
>                    zone_page_state(zone, NR_FREE_PAGES),
>                    zone->watermark_boost,
>                    min_wmark_pages(zone),
> @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m,=
 pg_data_t *pgdat,
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone_managed_pages(zone),
> -                  zone_cma_pages(zone));
> +                  zone_cma_pages(zone),
> +                  zone->order);
>
>         seq_printf(m,
>                    "\n        protection: (%ld",
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>