Re: [Chapter One] THP zones: the use cases of policy zones

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <shy828301@gmail.com>
To: Yu Zhao <yuzhao@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	 Jonathan Corbet <corbet@lwn.net>
Subject: Re: [Chapter One] THP zones: the use cases of policy zones
Date: Thu, 29 Feb 2024 15:31:36 -0800	[thread overview]
Message-ID: <CAHbLzkrpA1TfyLsOcqZ01KdR4-SjXpGrTOeJ+UjzeR_-2Feagw@mail.gmail.com> (raw)
In-Reply-To: <20240229183436.4110845-2-yuzhao@google.com>

On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
>
> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
>    memory.
> 2. The device zone provides interoperability between CPU and device
>    memory.
> 3. The movable zone commonly represents a memory allocation policy.
>
> Though originally designed for memory hot removal, the movable zone is
> instead widely used for other purposes, e.g., CMA and kdump kernel, on
> platforms that do not support hot removal, e.g., Android and ChromeOS.
> Nowadays, it is legitimately a zone independent of any physical
> characteristics. In spite of being somewhat regarded as a hack,
> largely due to the lack of a generic design concept for its true major
> use cases (on billions of client devices), the movable zone naturally
> resembles a policy (virtual) zone overlayed on the first four
> (physical) zones.
>
> This proposal formally generalizes this concept as policy zones so
> that additional policies can be implemented and enforced by subsequent
> zones after the movable zone. An inherited requirement of policy zones
> (and the first four zones) is that subsequent zones must be able to
> fall back to previous zones and therefore must add new properties to
> the previous zones rather than remove existing ones from them. Also,
> all properties must be known at the allocation time, rather than the
> runtime, e.g., memory object size and mobility are valid properties
> but hotness and lifetime are not.
>
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
>    ZONE_MOVABLE) and restricted to a minimum order to be
>    anti-fragmentation. The latter means that they cannot be split down
>    below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
>    to an exact order. The latter means that not only is split
>    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
>    reason in Chapter Three), while they are free or in use.
>
> Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> compaction is not needed for these two zones.
>
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
>    higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
>    unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
>    contiguous PTEs on arm64 [1], which are more suitable for client
>    workloads.

I think the allocation fallback policy needs to be elaborated. IIUC,
when allocating large folios, if the order > min order of the policy
zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
ZONE_MOVABLE    -> ZONE_NORMAL, right?

If all other zones are depleted, the allocation, whose order is < the
min order, won't fallback to the policy zones and will fail, just like
the non-movable allocation can't fallback to ZONE_MOVABLE even though
there is enough memory for that zone, right?

>
> Policy zones can be dynamically resized by offlining pages in one of
> them and onlining those pages in another of them. Note that this is
> only done among policy zones, not between a policy zone and a physical
> zone, since resizing is a (software) policy, not a physical
> characteristic.
>
> Implementing the same idea in the pageblock granularity has also been
> explored but rejected at Google. Pageblocks have a finer granularity
> and therefore can be more flexible than zones. The tradeoff is that
> this alternative implementation was more complex and failed to bring a
> better ROI. However, the rejection was mainly due to its inability to
> be smoothly extended to 1GB THPs [2], which is a planned use case of
> TAO.
>
> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  drivers/virtio/virtio_mem.c                   |   2 +-
>  include/linux/gfp.h                           |  24 +-
>  include/linux/huge_mm.h                       |   6 -
>  include/linux/mempolicy.h                     |   2 +-
>  include/linux/mmzone.h                        |  52 +-
>  include/linux/nodemask.h                      |   2 +-
>  include/linux/vm_event_item.h                 |   2 +-
>  include/trace/events/mmflags.h                |   4 +-
>  mm/compaction.c                               |  12 +
>  mm/huge_memory.c                              |   5 +-
>  mm/mempolicy.c                                |  14 +-
>  mm/migrate.c                                  |   7 +-
>  mm/mm_init.c                                  | 452 ++++++++++--------
>  mm/page_alloc.c                               |  44 +-
>  mm/page_isolation.c                           |   2 +-
>  mm/swap_slots.c                               |   3 +-
>  mm/vmscan.c                                   |  32 +-
>  mm/vmstat.c                                   |   7 +-
>  19 files changed, 431 insertions(+), 251 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 31b3a25680d0..a6c181f6efde 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3529,6 +3529,16 @@
>                         allocations which rules out almost all kernel
>                         allocations. Use with caution!
>
> +       nosplit=X,Y     [MM] Set the minimum order of the nosplit zone. Pages in
> +                       this zone can't be split down below order Y, while free
> +                       or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] or n%.
> +
> +       nomerge=X,Y     [MM] Set the exact orders of the nomerge zone. Pages in
> +                       this zone are always order Y, meaning they can't be
> +                       split or merged while free or in use.
> +                       Like movablecore, X should be either nn[KMGTPE] or n%.
> +
>         MTD_Partition=  [MTD]
>                         Format: <name>,<region-number>,<size>,<offset>
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 8e3223294442..37ecf5ee4afd 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm,
>                 page = pfn_to_online_page(pfn);
>                 if (!page)
>                         continue;
> -               if (page_zonenum(page) != ZONE_MOVABLE)
> +               if (!is_zone_movable_page(page))
>                         return false;
>         }
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..c0f9d21b4d18 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>   * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
>   */
>
> -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> -/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4
> +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */
>  #define GFP_ZONES_SHIFT 2
>  #else
>  #define GFP_ZONES_SHIFT ZONES_SHIFT
> @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags)
>         z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
>                                          ((1 << GFP_ZONES_SHIFT) - 1);
>         VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> +
> +       if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP))
> +               return LAST_VIRT_ZONE;
> +
>         return z;
>  }
>
> +extern int zone_nomerge_order __read_mostly;
> +extern int zone_nosplit_order __read_mostly;
> +
> +static inline enum zone_type gfp_order_zone(gfp_t flags, int order)
> +{
> +       enum zone_type zid = gfp_zone(flags);
> +
> +       if (zid >= ZONE_NOMERGE && order != zone_nomerge_order)
> +               zid = ZONE_NOMERGE - 1;
> +
> +       if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order)
> +               zid = ZONE_NOSPLIT - 1;
> +
> +       return zid;
> +}
> +
>  /*
>   * There is only one page-allocator function, and two main namespaces to
>   * it. The alloc_page*() variants return 'struct page *' and as such
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..9960ad7c3b10 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>                 unsigned long len, unsigned long pgoff, unsigned long flags);
>
>  void folio_prep_large_rmappable(struct folio *folio);
> -bool can_split_folio(struct folio *folio, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>  static inline int split_huge_page(struct page *page)
>  {
> @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {}
>
>  #define thp_get_unmapped_area  NULL
>
> -static inline bool
> -can_split_folio(struct folio *folio, int *pextra_pins)
> -{
> -       return false;
> -}
>  static inline int
>  split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..a92bcf47cf8c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,7 +150,7 @@ extern enum zone_type policy_zone;
>
>  static inline void check_highest_zone(enum zone_type k)
>  {
> -       if (k > policy_zone && k != ZONE_MOVABLE)
> +       if (k > policy_zone && !zid_is_virt(k))
>                 policy_zone = k;
>  }
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a497f189d988..532218167bba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -805,11 +805,15 @@ enum zone_type {
>          * there can be false negatives).
>          */
>         ZONE_MOVABLE,
> +       ZONE_NOSPLIT,
> +       ZONE_NOMERGE,
>  #ifdef CONFIG_ZONE_DEVICE
>         ZONE_DEVICE,
>  #endif
> -       __MAX_NR_ZONES
> +       __MAX_NR_ZONES,
>
> +       LAST_PHYS_ZONE = ZONE_MOVABLE - 1,
> +       LAST_VIRT_ZONE = ZONE_NOMERGE,
>  };
>
>  #ifndef __GENERATING_BOUNDS_H
> @@ -929,6 +933,8 @@ struct zone {
>         seqlock_t               span_seqlock;
>  #endif
>
> +       int order;
> +
>         int initialized;
>
>         /* Write-intensive fields used from the page allocator */
> @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio)
>
>  static inline bool is_zone_movable_page(const struct page *page)
>  {
> -       return page_zonenum(page) == ZONE_MOVABLE;
> +       return page_zonenum(page) >= ZONE_MOVABLE;
>  }
>
>  static inline bool folio_is_zone_movable(const struct folio *folio)
>  {
> -       return folio_zonenum(folio) == ZONE_MOVABLE;
> +       return folio_zonenum(folio) >= ZONE_MOVABLE;
> +}
> +
> +static inline bool page_can_split(struct page *page)
> +{
> +       return page_zonenum(page) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool folio_can_split(struct folio *folio)
> +{
> +       return folio_zonenum(folio) < ZONE_NOSPLIT;
>  }
>  #endif
>
> @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; };
>   */
>  #define zone_idx(zone)         ((zone) - (zone)->zone_pgdat->node_zones)
>
> +static inline bool zid_is_virt(enum zone_type zid)
> +{
> +       return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE;
> +}
> +
> +static inline bool zone_can_frag(struct zone *zone)
> +{
> +       VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT);
> +
> +       return zone_idx(zone) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool zone_is_suitable(struct zone *zone, int order)
> +{
> +       int zid = zone_idx(zone);
> +
> +       if (zid < ZONE_NOSPLIT)
> +               return true;
> +
> +       if (!zone->order)
> +               return false;
> +
> +       return (zid == ZONE_NOSPLIT && order >= zone->order) ||
> +              (zid == ZONE_NOMERGE && order == zone->order);
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  static inline bool zone_is_zone_device(struct zone *zone)
>  {
> @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone)
>  static inline void zone_set_nid(struct zone *zone, int nid) {}
>  #endif
>
> -extern int movable_zone;
> +extern int virt_zone;
>
>  static inline int is_highmem_idx(enum zone_type idx)
>  {
>  #ifdef CONFIG_HIGHMEM
>         return (idx == ZONE_HIGHMEM ||
> -               (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM));
> +               (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM));
>  #else
>         return 0;
>  #endif
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index b61438313a73..34fbe910576d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -404,7 +404,7 @@ enum node_states {
>  #else
>         N_HIGH_MEMORY = N_NORMAL_MEMORY,
>  #endif
> -       N_MEMORY,               /* The node has memory(regular, high, movable) */
> +       N_MEMORY,               /* The node has memory in any of the zones */
>         N_CPU,          /* The node has one or more cpus */
>         N_GENERIC_INITIATOR,    /* The node has one or more Generic Initiators */
>         NR_NODE_STATES
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..9a54d15d5ec3 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -27,7 +27,7 @@
>  #endif
>
>  #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \
> -       HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
> +       HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx)
>
>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 FOR_ALL_ZONES(PGALLOC)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index d801409b33cf..2b5fdafaadea 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>         IFDEF_ZONE_DMA32(       EM (ZONE_DMA32,  "DMA32"))      \
>                                 EM (ZONE_NORMAL, "Normal")      \
>         IFDEF_ZONE_HIGHMEM(     EM (ZONE_HIGHMEM,"HighMem"))    \
> -                               EMe(ZONE_MOVABLE,"Movable")
> +                               EM (ZONE_MOVABLE,"Movable")     \
> +                               EM (ZONE_NOSPLIT,"NoSplit")     \
> +                               EMe(ZONE_NOMERGE,"NoMerge")
>
>  #define LRU_NAMES              \
>                 EM (LRU_INACTIVE_ANON, "inactive_anon") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..8a64c805f411 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>                                         ac->highest_zoneidx, ac->nodemask) {
>                 enum compact_result status;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 if (prio > MIN_COMPACT_PRIORITY
>                                         && compaction_deferred(zone, order)) {
>                         rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
> @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone = zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2846,6 +2852,9 @@ static void compact_node(int nid)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 cc.zone = zone;
>
>                 compact_zone(&cc, NULL);
> @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>                 if (!populated_zone(zone))
>                         continue;
>
> +               if (!zone_can_frag(zone))
> +                       continue;
> +
>                 ret = compaction_suit_allocation_order(zone,
>                                 pgdat->kcompactd_max_order,
>                                 highest_zoneidx, ALLOC_WMARK_MIN);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c958f7ebb5..b57faa0a1e83 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  }
>
>  /* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int *pextra_pins)
> +static bool can_split_folio(struct folio *folio, int *pextra_pins)
>  {
>         int extra_pins;
>
> +       if (!folio_can_split(folio))
> +               return false;
> +
>         /* Additional pins from page cache */
>         if (folio_test_anon(folio))
>                 extra_pins = folio_test_swapcache(folio) ?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..1f84dd759086 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma)
>
>  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  {
> -       enum zone_type dynamic_policy_zone = policy_zone;
> -
> -       BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
> +       WARN_ON_ONCE(zid_is_virt(policy_zone));
>
>         /*
> -        * if policy->nodes has movable memory only,
> -        * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
> +        * If policy->nodes has memory in virtual zones only, we apply policy
> +        * only if gfp_zone(gfp) can allocate from those zones.
>          *
>          * policy->nodes is intersect with node_states[N_MEMORY].
>          * so if the following test fails, it implies
> -        * policy->nodes has movable memory only.
> +        * policy->nodes has memory in virtual zones only.
>          */
>         if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
> -               dynamic_policy_zone = ZONE_MOVABLE;
> +               return zone > LAST_PHYS_ZONE;
>
> -       return zone >= dynamic_policy_zone;
> +       return zone >= policy_zone;
>  }
>
>  /* Do dynamic interleaving for a process */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cc9f2bcd73b4..f615c0c22046 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>  {
>         int rc;
>
> +       if (!folio_can_split(folio))
> +               return -EBUSY;
> +
>         folio_lock(folio);
>         rc = split_folio_to_list(folio, split_folios);
>         folio_unlock(folio);
> @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
>                 order = folio_order(src);
>         }
>         zidx = zone_idx(folio_zone(src));
> -       if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> +       if (zidx > ZONE_NORMAL)
>                 gfp_mask |= __GFP_HIGHMEM;
>
>         return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
> @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>                                 break;
>                 }
>                 wakeup_kswapd(pgdat->node_zones + z, 0,
> -                             folio_order(folio), ZONE_MOVABLE);
> +                             folio_order(folio), z);
>                 return 0;
>         }
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36..7769c21e6d54 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init);
>
>  static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
>  static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
> -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
>
> -static unsigned long required_kernelcore __initdata;
> -static unsigned long required_kernelcore_percent __initdata;
> -static unsigned long required_movablecore __initdata;
> -static unsigned long required_movablecore_percent __initdata;
> +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata;
> +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid])
> +
> +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE])
> +
> +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE])
> +
> +int zone_nosplit_order __read_mostly;
> +int zone_nomerge_order __read_mostly;
>
>  static unsigned long nr_kernel_pages __initdata;
>  static unsigned long nr_all_pages __initdata;
> @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p)
>                 return 0;
>         }
>
> -       return cmdline_parse_core(p, &required_kernelcore,
> -                                 &required_kernelcore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE),
> +                                 &percentage_of(LAST_PHYS_ZONE));
>  }
>  early_param("kernelcore", cmdline_parse_kernelcore);
>
> @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore);
>   */
>  static int __init cmdline_parse_movablecore(char *p)
>  {
> -       return cmdline_parse_core(p, &required_movablecore,
> -                                 &required_movablecore_percent);
> +       return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE),
> +                                 &percentage_of(ZONE_MOVABLE));
>  }
>  early_param("movablecore", cmdline_parse_movablecore);
>
> +static int __init parse_zone_order(char *p, unsigned long *nr_pages,
> +                                  unsigned long *percent, int *order)
> +{
> +       int err;
> +       unsigned long n;
> +       char *s = strchr(p, ',');
> +
> +       if (!s)
> +               return -EINVAL;
> +
> +       *s++ = '\0';
> +
> +       err = kstrtoul(s, 0, &n);
> +       if (err)
> +               return err;
> +
> +       if (n < 2 || n > MAX_PAGE_ORDER)
> +               return -EINVAL;
> +
> +       err = cmdline_parse_core(p, nr_pages, percent);
> +       if (err)
> +               return err;
> +
> +       *order = n;
> +
> +       return 0;
> +}
> +
> +static int __init parse_zone_nosplit(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT),
> +                               &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order);
> +}
> +early_param("nosplit", parse_zone_nosplit);
> +
> +static int __init parse_zone_nomerge(char *p)
> +{
> +       return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE),
> +                               &percentage_of(ZONE_NOMERGE), &zone_nomerge_order);
> +}
> +early_param("nomerge", parse_zone_nomerge);
> +
>  /*
>   * early_calculate_totalpages()
> - * Sum pages in active regions for movable zone.
> + * Sum pages in active regions for virtual zones.
>   * Populate N_MEMORY for calculating usable_nodes.
>   */
>  static unsigned long __init early_calculate_totalpages(void)
> @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void)
>  }
>
>  /*
> - * This finds a zone that can be used for ZONE_MOVABLE pages. The
> + * This finds a physical zone that can be used for virtual zones. The
>   * assumption is made that zones within a node are ordered in monotonic
>   * increasing memory addresses so that the "highest" populated zone is used
>   */
> -static void __init find_usable_zone_for_movable(void)
> +static void __init find_usable_zone(void)
>  {
>         int zone_index;
> -       for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
> -               if (zone_index == ZONE_MOVABLE)
> -                       continue;
> -
> +       for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) {
>                 if (arch_zone_highest_possible_pfn[zone_index] >
>                                 arch_zone_lowest_possible_pfn[zone_index])
>                         break;
>         }
>
>         VM_BUG_ON(zone_index == -1);
> -       movable_zone = zone_index;
> +       virt_zone = zone_index;
> +}
> +
> +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn)
> +{
> +       int i, nid;
> +       unsigned long node_avg, remaining;
> +       int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> +       /* usable_startpfn is the lowest possible pfn virtual zones can be at */
> +       unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone];
> +
> +restart:
> +       /* Carve out memory as evenly as possible throughout nodes */
> +       node_avg = occupied / usable_nodes;
> +       for_each_node_state(nid, N_MEMORY) {
> +               unsigned long start_pfn, end_pfn;
> +
> +               /*
> +                * Recalculate node_avg if the division per node now exceeds
> +                * what is necessary to satisfy the amount of memory to carve
> +                * out.
> +                */
> +               if (occupied < node_avg)
> +                       node_avg = occupied / usable_nodes;
> +
> +               /*
> +                * As the map is walked, we track how much memory is usable
> +                * using remaining. When it is 0, the rest of the node is
> +                * usable.
> +                */
> +               remaining = node_avg;
> +
> +               /* Go through each range of PFNs within this node */
> +               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +                       unsigned long size_pages;
> +
> +                       start_pfn = max(start_pfn, zone_pfn[nid]);
> +                       if (start_pfn >= end_pfn)
> +                               continue;
> +
> +                       /* Account for what is only usable when carving out */
> +                       if (start_pfn < usable_startpfn) {
> +                               unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn;
> +
> +                               remaining -= min(nr_pages, remaining);
> +                               occupied -= min(nr_pages, occupied);
> +
> +                               /* Continue if range is now fully accounted */
> +                               if (end_pfn <= usable_startpfn) {
> +
> +                                       /*
> +                                        * Push zone_pfn to the end so that if
> +                                        * we have to carve out more across
> +                                        * nodes, we will not double account
> +                                        * here.
> +                                        */
> +                                       zone_pfn[nid] = end_pfn;
> +                                       continue;
> +                               }
> +                               start_pfn = usable_startpfn;
> +                       }
> +
> +                       /*
> +                        * The usable PFN range is from start_pfn->end_pfn.
> +                        * Calculate size_pages as the number of pages used.
> +                        */
> +                       size_pages = end_pfn - start_pfn;
> +                       if (size_pages > remaining)
> +                               size_pages = remaining;
> +                       zone_pfn[nid] = start_pfn + size_pages;
> +
> +                       /*
> +                        * Some memory was carved out, update counts and break
> +                        * if the request for this node has been satisfied.
> +                        */
> +                       occupied -= min(occupied, size_pages);
> +                       remaining -= size_pages;
> +                       if (!remaining)
> +                               break;
> +               }
> +       }
> +
> +       /*
> +        * If there is still more to carve out, we do another pass with one less
> +        * node in the count. This will push zone_pfn[nid] further along on the
> +        * nodes that still have memory until the request is fully satisfied.
> +        */
> +       usable_nodes--;
> +       if (usable_nodes && occupied > usable_nodes)
> +               goto restart;
>  }
>
>  /*
> @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void)
>   * memory. When they don't, some nodes will have more kernelcore than
>   * others
>   */
> -static void __init find_zone_movable_pfns_for_nodes(void)
> +static void __init find_virt_zones(void)
>  {
> -       int i, nid;
> +       int i;
> +       int nid;
>         unsigned long usable_startpfn;
> -       unsigned long kernelcore_node, kernelcore_remaining;
>         /* save the state before borrow the nodemask */
>         nodemask_t saved_node_state = node_states[N_MEMORY];
>         unsigned long totalpages = early_calculate_totalpages();
> -       int usable_nodes = nodes_weight(node_states[N_MEMORY]);
>         struct memblock_region *r;
> +       unsigned long occupied = 0;
>
> -       /* Need to find movable_zone earlier when movable_node is specified. */
> -       find_usable_zone_for_movable();
> +       /* Need to find virt_zone earlier when movable_node is specified. */
> +       find_usable_zone();
>
>         /*
>          * If movable_node is specified, ignore kernelcore and movablecore
> @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                         nid = memblock_get_region_node(r);
>
>                         usable_startpfn = PFN_DOWN(r->base);
> -                       zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid]) :
> +                       pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                                 continue;
>                         }
>
> -                       zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> -                               min(usable_startpfn, zone_movable_pfn[nid]) :
> +                       pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> +                               min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
>                                 usable_startpfn;
>                 }
>
> @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>                 goto out2;
>         }
>
> +       if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) {
> +               nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0;
> +               percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0;
> +               zone_nosplit_order = zone_nomerge_order = 0;
> +               pr_warn("zone %s order %d must be higher zone %s order %d\n",
> +                       zone_names[ZONE_NOMERGE], zone_nomerge_order,
> +                       zone_names[ZONE_NOSPLIT], zone_nosplit_order);
> +       }
> +
>         /*
>          * If kernelcore=nn% or movablecore=nn% was specified, calculate the
>          * amount of necessary memory.
>          */
> -       if (required_kernelcore_percent)
> -               required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
> -                                      10000UL;
> -       if (required_movablecore_percent)
> -               required_movablecore = (totalpages * 100 * required_movablecore_percent) /
> -                                       10000UL;
> +       for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) {
> +               if (percentage_of(i))
> +                       nr_pages_of(i) = totalpages * percentage_of(i) / 100;
> +
> +               nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES);
> +               occupied += nr_pages_of(i);
> +       }
>
>         /*
>          * If movablecore= was specified, calculate what size of
>          * kernelcore that corresponds so that memory usable for
>          * any allocation type is evenly spread. If both kernelcore
>          * and movablecore are specified, then the value of kernelcore
> -        * will be used for required_kernelcore if it's greater than
> -        * what movablecore would have allowed.
> +        * will be used if it's greater than what movablecore would have
> +        * allowed.
>          */
> -       if (required_movablecore) {
> -               unsigned long corepages;
> +       if (occupied < totalpages) {
> +               enum zone_type zid;
>
> -               /*
> -                * Round-up so that ZONE_MOVABLE is at least as large as what
> -                * was requested by the user
> -                */
> -               required_movablecore =
> -                       roundup(required_movablecore, MAX_ORDER_NR_PAGES);
> -               required_movablecore = min(totalpages, required_movablecore);
> -               corepages = totalpages - required_movablecore;
> -
> -               required_kernelcore = max(required_kernelcore, corepages);
> +               zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ?
> +                     LAST_PHYS_ZONE : ZONE_MOVABLE;
> +               nr_pages_of(zid) += totalpages - occupied;
>         }
>
>         /*
>          * If kernelcore was not specified or kernelcore size is larger
> -        * than totalpages, there is no ZONE_MOVABLE.
> +        * than totalpages, there are not virtual zones.
>          */
> -       if (!required_kernelcore || required_kernelcore >= totalpages)
> +       occupied = nr_pages_of(LAST_PHYS_ZONE);
> +       if (!occupied || occupied >= totalpages)
>                 goto out;
>
> -       /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
> -       usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
> +       for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> +               if (!nr_pages_of(i))
> +                       continue;
>
> -restart:
> -       /* Spread kernelcore memory as evenly as possible throughout nodes */
> -       kernelcore_node = required_kernelcore / usable_nodes;
> -       for_each_node_state(nid, N_MEMORY) {
> -               unsigned long start_pfn, end_pfn;
> -
> -               /*
> -                * Recalculate kernelcore_node if the division per node
> -                * now exceeds what is necessary to satisfy the requested
> -                * amount of memory for the kernel
> -                */
> -               if (required_kernelcore < kernelcore_node)
> -                       kernelcore_node = required_kernelcore / usable_nodes;
> -
> -               /*
> -                * As the map is walked, we track how much memory is usable
> -                * by the kernel using kernelcore_remaining. When it is
> -                * 0, the rest of the node is usable by ZONE_MOVABLE
> -                */
> -               kernelcore_remaining = kernelcore_node;
> -
> -               /* Go through each range of PFNs within this node */
> -               for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> -                       unsigned long size_pages;
> -
> -                       start_pfn = max(start_pfn, zone_movable_pfn[nid]);
> -                       if (start_pfn >= end_pfn)
> -                               continue;
> -
> -                       /* Account for what is only usable for kernelcore */
> -                       if (start_pfn < usable_startpfn) {
> -                               unsigned long kernel_pages;
> -                               kernel_pages = min(end_pfn, usable_startpfn)
> -                                                               - start_pfn;
> -
> -                               kernelcore_remaining -= min(kernel_pages,
> -                                                       kernelcore_remaining);
> -                               required_kernelcore -= min(kernel_pages,
> -                                                       required_kernelcore);
> -
> -                               /* Continue if range is now fully accounted */
> -                               if (end_pfn <= usable_startpfn) {
> -
> -                                       /*
> -                                        * Push zone_movable_pfn to the end so
> -                                        * that if we have to rebalance
> -                                        * kernelcore across nodes, we will
> -                                        * not double account here
> -                                        */
> -                                       zone_movable_pfn[nid] = end_pfn;
> -                                       continue;
> -                               }
> -                               start_pfn = usable_startpfn;
> -                       }
> -
> -                       /*
> -                        * The usable PFN range for ZONE_MOVABLE is from
> -                        * start_pfn->end_pfn. Calculate size_pages as the
> -                        * number of pages used as kernelcore
> -                        */
> -                       size_pages = end_pfn - start_pfn;
> -                       if (size_pages > kernelcore_remaining)
> -                               size_pages = kernelcore_remaining;
> -                       zone_movable_pfn[nid] = start_pfn + size_pages;
> -
> -                       /*
> -                        * Some kernelcore has been met, update counts and
> -                        * break if the kernelcore for this node has been
> -                        * satisfied
> -                        */
> -                       required_kernelcore -= min(required_kernelcore,
> -                                                               size_pages);
> -                       kernelcore_remaining -= size_pages;
> -                       if (!kernelcore_remaining)
> -                               break;
> -               }
> +               find_virt_zone(occupied, &pfn_of(i, 0));
> +               occupied += nr_pages_of(i);
>         }
> -
> -       /*
> -        * If there is still required_kernelcore, we do another pass with one
> -        * less node in the count. This will push zone_movable_pfn[nid] further
> -        * along on the nodes that still have memory until kernelcore is
> -        * satisfied
> -        */
> -       usable_nodes--;
> -       if (usable_nodes && required_kernelcore > usable_nodes)
> -               goto restart;
> -
>  out2:
> -       /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
> +       /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */
>         for (nid = 0; nid < MAX_NUMNODES; nid++) {
>                 unsigned long start_pfn, end_pfn;
> -
> -               zone_movable_pfn[nid] =
> -                       roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
> +               unsigned long prev_virt_zone_pfn = 0;
>
>                 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> -               if (zone_movable_pfn[nid] >= end_pfn)
> -                       zone_movable_pfn[nid] = 0;
> +
> +               for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> +                       pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES);
> +
> +                       if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn)
> +                               pfn_of(i, nid) = 0;
> +
> +                       if (pfn_of(i, nid))
> +                               prev_virt_zone_pfn = pfn_of(i, nid);
> +               }
>         }
> -
>  out:
>         /* restore the node_state */
>         node_states[N_MEMORY] = saved_node_state;
> @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone,
>  #endif
>
>  /*
> - * The zone ranges provided by the architecture do not include ZONE_MOVABLE
> - * because it is sized independent of architecture. Unlike the other zones,
> - * the starting point for ZONE_MOVABLE is not fixed. It may be different
> - * in each node depending on the size of each node and how evenly kernelcore
> - * is distributed. This helper function adjusts the zone ranges
> + * The zone ranges provided by the architecture do not include virtual zones
> + * because they are sized independent of architecture. Unlike physical zones,
> + * the starting point for the first populated virtual zone is not fixed. It may
> + * be different in each node depending on the size of each node and how evenly
> + * kernelcore is distributed. This helper function adjusts the zone ranges
>   * provided by the architecture for a given node by using the end of the
> - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
> - * zones within a node are in order of monotonic increases memory addresses
> + * highest usable zone for the first populated virtual zone. This preserves the
> + * assumption that zones within a node are in order of monotonic increases
> + * memory addresses.
>   */
> -static void __init adjust_zone_range_for_zone_movable(int nid,
> +static void __init adjust_zone_range(int nid,
>                                         unsigned long zone_type,
>                                         unsigned long node_end_pfn,
>                                         unsigned long *zone_start_pfn,
>                                         unsigned long *zone_end_pfn)
>  {
> -       /* Only adjust if ZONE_MOVABLE is on this node */
> -       if (zone_movable_pfn[nid]) {
> -               /* Size ZONE_MOVABLE */
> -               if (zone_type == ZONE_MOVABLE) {
> -                       *zone_start_pfn = zone_movable_pfn[nid];
> -                       *zone_end_pfn = min(node_end_pfn,
> -                               arch_zone_highest_possible_pfn[movable_zone]);
> +       int i = max_t(int, zone_type, LAST_PHYS_ZONE);
> +       unsigned long next_virt_zone_pfn = 0;
>
> -               /* Adjust for ZONE_MOVABLE starting within this range */
> -               } else if (!mirrored_kernelcore &&
> -                       *zone_start_pfn < zone_movable_pfn[nid] &&
> -                       *zone_end_pfn > zone_movable_pfn[nid]) {
> -                       *zone_end_pfn = zone_movable_pfn[nid];
> +       while (i++ < LAST_VIRT_ZONE) {
> +               if (pfn_of(i, nid)) {
> +                       next_virt_zone_pfn = pfn_of(i, nid);
> +                       break;
> +               }
> +       }
>
> -               /* Check if this whole range is within ZONE_MOVABLE */
> -               } else if (*zone_start_pfn >= zone_movable_pfn[nid])
> +       if (zone_type <= LAST_PHYS_ZONE) {
> +               if (!next_virt_zone_pfn)
> +                       return;
> +
> +               if (!mirrored_kernelcore &&
> +                   *zone_start_pfn < next_virt_zone_pfn &&
> +                   *zone_end_pfn > next_virt_zone_pfn)
> +                       *zone_end_pfn = next_virt_zone_pfn;
> +               else if (*zone_start_pfn >= next_virt_zone_pfn)
>                         *zone_start_pfn = *zone_end_pfn;
> +       } else if (zone_type <= LAST_VIRT_ZONE) {
> +               if (!pfn_of(zone_type, nid))
> +                       return;
> +
> +               if (next_virt_zone_pfn)
> +                       *zone_end_pfn = min3(next_virt_zone_pfn,
> +                                            node_end_pfn,
> +                                            arch_zone_highest_possible_pfn[virt_zone]);
> +               else
> +                       *zone_end_pfn = min(node_end_pfn,
> +                                           arch_zone_highest_possible_pfn[virt_zone]);
> +               *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid));
>         }
>  }
>
> @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
>          * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
>          * and vice versa.
>          */
> -       if (mirrored_kernelcore && zone_movable_pfn[nid]) {
> +       if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) {
>                 unsigned long start_pfn, end_pfn;
>                 struct memblock_region *r;
>
> @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
>         /* Get the start and end of the zone */
>         *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
>         *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
> -       adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn,
> -                                          zone_start_pfn, zone_end_pfn);
> +       adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn);
>
>         /* Check that this node has pages within the zone's required range */
>         if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
> @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
>  #if defined(CONFIG_MEMORY_HOTPLUG)
>                 zone->present_early_pages = real_size;
>  #endif
> +               if (i == ZONE_NOSPLIT)
> +                       zone->order = zone_nosplit_order;
> +               if (i == ZONE_NOMERGE)
> +                       zone->order = zone_nomerge_order;
>
>                 totalpages += spanned;
>                 realtotalpages += real_size;
> @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat)
>  {
>         enum zone_type zone_type;
>
> -       for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
> +       for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) {
>                 struct zone *zone = &pgdat->node_zones[zone_type];
>                 if (populated_zone(zone)) {
>                         if (IS_ENABLED(CONFIG_HIGHMEM))
> @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void)
>  void __init free_area_init(unsigned long *max_zone_pfn)
>  {
>         unsigned long start_pfn, end_pfn;
> -       int i, nid, zone;
> +       int i, j, nid, zone;
>         bool descending;
>
>         /* Record where the zone boundaries are */
> @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>         start_pfn = PHYS_PFN(memblock_start_of_DRAM());
>         descending = arch_has_descending_max_zone_pfns();
>
> -       for (i = 0; i < MAX_NR_ZONES; i++) {
> +       for (i = 0; i <= LAST_PHYS_ZONE; i++) {
>                 if (descending)
> -                       zone = MAX_NR_ZONES - i - 1;
> +                       zone = LAST_PHYS_ZONE - i;
>                 else
>                         zone = i;
>
> -               if (zone == ZONE_MOVABLE)
> -                       continue;
> -
>                 end_pfn = max(max_zone_pfn[zone], start_pfn);
>                 arch_zone_lowest_possible_pfn[zone] = start_pfn;
>                 arch_zone_highest_possible_pfn[zone] = end_pfn;
> @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>                 start_pfn = end_pfn;
>         }
>
> -       /* Find the PFNs that ZONE_MOVABLE begins at in each node */
> -       memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> -       find_zone_movable_pfns_for_nodes();
> +       /* Find the PFNs that virtual zones begin at in each node */
> +       find_virt_zones();
>
>         /* Print out the zone ranges */
>         pr_info("Zone ranges:\n");
> -       for (i = 0; i < MAX_NR_ZONES; i++) {
> -               if (i == ZONE_MOVABLE)
> -                       continue;
> +       for (i = 0; i <= LAST_PHYS_ZONE; i++) {
>                 pr_info("  %-8s ", zone_names[i]);
>                 if (arch_zone_lowest_possible_pfn[i] ==
>                                 arch_zone_highest_possible_pfn[i])
> @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>                                         << PAGE_SHIFT) - 1);
>         }
>
> -       /* Print out the PFNs ZONE_MOVABLE begins at in each node */
> -       pr_info("Movable zone start for each node\n");
> -       for (i = 0; i < MAX_NUMNODES; i++) {
> -               if (zone_movable_pfn[i])
> -                       pr_info("  Node %d: %#018Lx\n", i,
> -                              (u64)zone_movable_pfn[i] << PAGE_SHIFT);
> +       /* Print out the PFNs virtual zones begin at in each node */
> +       for (; i <= LAST_VIRT_ZONE; i++) {
> +               pr_info("%s zone start for each node\n", zone_names[i]);
> +               for (j = 0; j < MAX_NUMNODES; j++) {
> +                       if (pfn_of(i, j))
> +                               pr_info("  Node %d: %#018Lx\n",
> +                                       j, (u64)pfn_of(i, j) << PAGE_SHIFT);
> +               }
>         }
>
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..6a4da8f8691c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = {
>          "HighMem",
>  #endif
>          "Movable",
> +        "NoSplit",
> +        "NoMerge",
>  #ifdef CONFIG_ZONE_DEVICE
>          "Device",
>  #endif
> @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1;
>  static int watermark_boost_factor __read_mostly = 15000;
>  static int watermark_scale_factor = 10;
>
> -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> -int movable_zone;
> -EXPORT_SYMBOL(movable_zone);
> +/* virt_zone is the "real" zone pages in virtual zones are taken from */
> +int virt_zone;
> +EXPORT_SYMBOL(virt_zone);
>
>  #if MAX_NUMNODES > 1
>  unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>         unsigned long higher_page_pfn;
>         struct page *higher_page;
>
> -       if (order >= MAX_PAGE_ORDER - 1)
> -               return false;
> -
>         higher_page_pfn = buddy_pfn & pfn;
>         higher_page = page + (higher_page_pfn - pfn);
>
> @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>                         NULL) != NULL;
>  }
>
> +static int zone_max_order(struct zone *zone)
> +{
> +       return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER;
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page,
>         unsigned long combined_pfn;
>         struct page *buddy;
>         bool to_tail;
> +       int max_order = zone_max_order(zone);
>
>         VM_BUG_ON(!zone_is_initialized(zone));
>         VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page,
>         VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>         VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> -       while (order < MAX_PAGE_ORDER) {
> +       while (order < max_order) {
>                 if (compaction_capture(capc, page, order, migratetype)) {
>                         __mod_zone_freepage_state(zone, -(1 << order),
>                                                                 migratetype);
> @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page,
>                 to_tail = true;
>         else if (is_shuffle_order(order))
>                 to_tail = shuffle_pick_tail();
> +       else if (order + 1 >= max_order)
> +               to_tail = false;
>         else
>                 to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
> @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page,
>         int mt;
>         int ret = 0;
>
> +       VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page);
> +
>         if (split_pfn_offset == 0)
>                 return ret;
>
> @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>         struct free_area *area;
>         struct page *page;
>
> +       VM_WARN_ON_ONCE(!zone_is_suitable(zone, order));
> +
>         /* Find a page of the appropriate size in the preferred list */
>         for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
>                 area = &(zone->free_area[current_order]);
> @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>         long min = mark;
>         int o;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         /* free_pages may go negative - that's OK */
>         free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>
> @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  {
>         long free_pages;
>
> +       if (!zone_is_suitable(z, order))
> +               return false;
> +
>         free_pages = zone_page_state(z, NR_FREE_PAGES);
>
>         /*
> @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>                 struct page *page;
>                 unsigned long mark;
>
> +               if (!zone_is_suitable(zone, order))
> +                       continue;
> +
>                 if (cpusets_enabled() &&
>                         (alloc_flags & ALLOC_CPUSET) &&
>                         !__cpuset_zone_allowed(zone, gfp_mask))
> @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void)
>         struct zone *zone;
>         unsigned long flags;
>
> -       /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */
> +       /* Calculate total number of pages below ZONE_HIGHMEM */
>         for_each_zone(zone) {
> -               if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
> +               if (zone_idx(zone) <= ZONE_NORMAL)
>                         lowmem_pages += zone_managed_pages(zone);
>         }
>
> @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void)
>                 spin_lock_irqsave(&zone->lock, flags);
>                 tmp = (u64)pages_min * zone_managed_pages(zone);
>                 do_div(tmp, lowmem_pages);
> -               if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
> +               if (zone_idx(zone) > ZONE_NORMAL) {
>                         /*
>                          * __GFP_HIGH and PF_MEMALLOC allocations usually don't
> -                        * need highmem and movable zones pages, so cap pages_min
> -                        * to a small  value here.
> +                        * need pages from zones above ZONE_NORMAL, so cap
> +                        * pages_min to a small value here.
>                          *
>                          * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
>                          * deltas control async page reclaim, and so should
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index cd0ea3668253..8a6473543427 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>                  * pages then it should be reasonably safe to assume the rest
>                  * is movable.
>                  */
> -               if (zone_idx(zone) == ZONE_MOVABLE)
> +               if (zid_is_virt(zone_idx(zone)))
>                         continue;
>
>                 /*
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..ad0db0373b05 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>         entry.val = 0;
>
>         if (folio_test_large(folio)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() &&
> +                   folio_test_pmd_mappable(folio))
>                         get_swap_pages(1, &entry, folio_nr_pages(folio));
>                 goto out;
>         }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..ae061ec4866a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         goto keep_locked;
>                                 if (folio_maybe_dma_pinned(folio))
>                                         goto keep_locked;
> -                               if (folio_test_large(folio)) {
> -                                       /* cannot split folio, skip it */
> -                                       if (!can_split_folio(folio, NULL))
> -                                               goto activate_locked;
> -                                       /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> -                                        */
> -                                       if (!folio_entire_mapcount(folio) &&
> -                                           split_folio_to_list(folio,
> -                                                               folio_list))
> -                                               goto activate_locked;
> -                               }
> +                               /*
> +                                * Split folios that are not fully map right
> +                                * away. Chances are some of the tail pages can
> +                                * be freed without IO.
> +                                */
> +                               if (folio_test_large(folio) &&
> +                                   atomic_read(&folio->_nr_pages_mapped) < nr_pages)
> +                                       split_folio_to_list(folio, folio_list);
>                                 if (!add_to_swap(folio)) {
>                                         if (!folio_test_large(folio))
>                                                 goto activate_locked_split;
> @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>         orig_mask = sc->gfp_mask;
>         if (buffer_heads_over_limit) {
>                 sc->gfp_mask |= __GFP_HIGHMEM;
> -               sc->reclaim_idx = gfp_zone(sc->gfp_mask);
> +               sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order);
>         }
>
>         for_each_zone_zonelist_nodemask(zone, z, zonelist,
> @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>         struct scan_control sc = {
>                 .nr_to_reclaim = SWAP_CLUSTER_MAX,
>                 .gfp_mask = current_gfp_context(gfp_mask),
> -               .reclaim_idx = gfp_zone(gfp_mask),
> +               .reclaim_idx = gfp_order_zone(gfp_mask, order),
>                 .order = order,
>                 .nodemask = nodemask,
>                 .priority = DEF_PRIORITY,
> @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
>         if (!cpuset_zone_allowed(zone, gfp_flags))
>                 return;
>
> +       curr_idx = gfp_order_zone(gfp_flags, order);
> +       if (highest_zoneidx > curr_idx)
> +               highest_zoneidx = curr_idx;
> +
>         pgdat = zone->zone_pgdat;
>         curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx);
>
> @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>                 .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
>                 .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
>                 .may_swap = 1,
> -               .reclaim_idx = gfp_zone(gfp_mask),
> +               .reclaim_idx = gfp_order_zone(gfp_mask, order),
>         };
>         unsigned long pflags;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..adbd032e6a0f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>
>  #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
>                                         TEXT_FOR_HIGHMEM(xx) xx "_movable", \
> +                                       xx "_nosplit", xx "_nomerge", \
>                                         TEXT_FOR_DEVICE(xx)
>
>  const char * const vmstat_text[] = {
> @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu"
> -                  "\n        cma      %lu",
> +                  "\n        cma      %lu"
> +                  "\n        order    %u",
>                    zone_page_state(zone, NR_FREE_PAGES),
>                    zone->watermark_boost,
>                    min_wmark_pages(zone),
> @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone_managed_pages(zone),
> -                  zone_cma_pages(zone));
> +                  zone_cma_pages(zone),
> +                  zone->order);
>
>         seq_printf(m,
>                    "\n        protection: (%ld",
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>

next prev parent reply	other threads:[~2024-02-29 23:31 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
2024-02-29 20:28   ` Matthew Wilcox
2024-03-06  3:51     ` Yu Zhao
2024-03-06  4:33       ` Matthew Wilcox
2024-02-29 23:31   ` Yang Shi [this message]
2024-03-03  2:47     ` Yu Zhao
2024-03-04 15:19   ` Matthew Wilcox
2024-03-05 17:22     ` Matthew Wilcox
2024-03-05  8:41   ` Barry Song
2024-03-05 10:07     ` Vlastimil Babka
2024-03-05 21:04       ` Barry Song
2024-03-06  3:05         ` Yu Zhao
2024-05-24  8:38   ` Barry Song
2024-11-01  2:35   ` Charan Teja Kalla
2024-11-01 16:55     ` Yu Zhao
2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
2024-02-29 21:55   ` Zi Yan
2024-03-03  1:17     ` Yu Zhao
2024-03-03  1:21       ` Zi Yan
2024-06-11  8:32   ` Barry Song
2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
2024-02-29 22:54   ` Yang Shi
2024-03-01 15:42     ` David Hildenbrand
2024-03-03  1:46     ` Yu Zhao
2024-02-29 18:34 ` [Epilogue] Profile-Guided Heap Optimization and THP fungibility Yu Zhao
2024-03-05  8:37 ` [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Barry Song
2024-03-06 15:51 ` Johannes Weiner
2024-03-06 16:40   ` Zi Yan
2024-03-13 22:09   ` Kaiyang Zhao
2024-05-15 21:17 ` Yu Zhao
2024-05-15 21:52   ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkrpA1TfyLsOcqZ01KdR4-SjXpGrTOeJ+UjzeR_-2Feagw@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=corbet@lwn.net \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox