From: Yang Shi <shy828301@gmail.com>
To: Yu Zhao <yuzhao@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
Jonathan Corbet <corbet@lwn.net>
Subject: Re: [Chapter One] THP zones: the use cases of policy zones
Date: Thu, 29 Feb 2024 15:31:36 -0800 [thread overview]
Message-ID: <CAHbLzkrpA1TfyLsOcqZ01KdR4-SjXpGrTOeJ+UjzeR_-2Feagw@mail.gmail.com> (raw)
In-Reply-To: <20240229183436.4110845-2-yuzhao@google.com>
On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote:
>
> There are three types of zones:
> 1. The first four zones partition the physical address space of CPU
> memory.
> 2. The device zone provides interoperability between CPU and device
> memory.
> 3. The movable zone commonly represents a memory allocation policy.
>
> Though originally designed for memory hot removal, the movable zone is
> instead widely used for other purposes, e.g., CMA and kdump kernel, on
> platforms that do not support hot removal, e.g., Android and ChromeOS.
> Nowadays, it is legitimately a zone independent of any physical
> characteristics. In spite of being somewhat regarded as a hack,
> largely due to the lack of a generic design concept for its true major
> use cases (on billions of client devices), the movable zone naturally
> resembles a policy (virtual) zone overlayed on the first four
> (physical) zones.
>
> This proposal formally generalizes this concept as policy zones so
> that additional policies can be implemented and enforced by subsequent
> zones after the movable zone. An inherited requirement of policy zones
> (and the first four zones) is that subsequent zones must be able to
> fall back to previous zones and therefore must add new properties to
> the previous zones rather than remove existing ones from them. Also,
> all properties must be known at the allocation time, rather than the
> runtime, e.g., memory object size and mobility are valid properties
> but hotness and lifetime are not.
>
> ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> zones:
> 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
> ZONE_MOVABLE) and restricted to a minimum order to be
> anti-fragmentation. The latter means that they cannot be split down
> below that order, while they are free or in use.
> 2. ZONE_NOMERGE, which contains pages that are movable and restricted
> to an exact order. The latter means that not only is split
> prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
> reason in Chapter Three), while they are free or in use.
>
> Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> compaction is not needed for these two zones.
>
> Compared with the hugeTLB pool approach, THP zones tap into core MM
> features including:
> 1. THP allocations can fall back to the lower zones, which can have
> higher latency but still succeed.
> 2. THPs can be either shattered (see Chapter Two) if partially
> unmapped or reclaimed if becoming cold.
> 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
> contiguous PTEs on arm64 [1], which are more suitable for client
> workloads.
I think the allocation fallback policy needs to be elaborated. IIUC,
when allocating large folios, if the order > min order of the policy
zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
ZONE_MOVABLE -> ZONE_NORMAL, right?
If all other zones are depleted, the allocation, whose order is < the
min order, won't fallback to the policy zones and will fail, just like
the non-movable allocation can't fallback to ZONE_MOVABLE even though
there is enough memory for that zone, right?
>
> Policy zones can be dynamically resized by offlining pages in one of
> them and onlining those pages in another of them. Note that this is
> only done among policy zones, not between a policy zone and a physical
> zone, since resizing is a (software) policy, not a physical
> characteristic.
>
> Implementing the same idea in the pageblock granularity has also been
> explored but rejected at Google. Pageblocks have a finer granularity
> and therefore can be more flexible than zones. The tradeoff is that
> this alternative implementation was more complex and failed to bring a
> better ROI. However, the rejection was mainly due to its inability to
> be smoothly extended to 1GB THPs [2], which is a planned use case of
> TAO.
>
> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> .../admin-guide/kernel-parameters.txt | 10 +
> drivers/virtio/virtio_mem.c | 2 +-
> include/linux/gfp.h | 24 +-
> include/linux/huge_mm.h | 6 -
> include/linux/mempolicy.h | 2 +-
> include/linux/mmzone.h | 52 +-
> include/linux/nodemask.h | 2 +-
> include/linux/vm_event_item.h | 2 +-
> include/trace/events/mmflags.h | 4 +-
> mm/compaction.c | 12 +
> mm/huge_memory.c | 5 +-
> mm/mempolicy.c | 14 +-
> mm/migrate.c | 7 +-
> mm/mm_init.c | 452 ++++++++++--------
> mm/page_alloc.c | 44 +-
> mm/page_isolation.c | 2 +-
> mm/swap_slots.c | 3 +-
> mm/vmscan.c | 32 +-
> mm/vmstat.c | 7 +-
> 19 files changed, 431 insertions(+), 251 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 31b3a25680d0..a6c181f6efde 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3529,6 +3529,16 @@
> allocations which rules out almost all kernel
> allocations. Use with caution!
>
> + nosplit=X,Y [MM] Set the minimum order of the nosplit zone. Pages in
> + this zone can't be split down below order Y, while free
> + or in use.
> + Like movablecore, X should be either nn[KMGTPE] or n%.
> +
> + nomerge=X,Y [MM] Set the exact orders of the nomerge zone. Pages in
> + this zone are always order Y, meaning they can't be
> + split or merged while free or in use.
> + Like movablecore, X should be either nn[KMGTPE] or n%.
> +
> MTD_Partition= [MTD]
> Format: <name>,<region-number>,<size>,<offset>
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 8e3223294442..37ecf5ee4afd 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm,
> page = pfn_to_online_page(pfn);
> if (!page)
> continue;
> - if (page_zonenum(page) != ZONE_MOVABLE)
> + if (!is_zone_movable_page(page))
> return false;
> }
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..c0f9d21b4d18 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
> */
>
> -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
> -/* ZONE_DEVICE is not a valid GFP zone specifier */
> +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4
> +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */
> #define GFP_ZONES_SHIFT 2
> #else
> #define GFP_ZONES_SHIFT ZONES_SHIFT
> @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags)
> z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
> ((1 << GFP_ZONES_SHIFT) - 1);
> VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
> +
> + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP))
> + return LAST_VIRT_ZONE;
> +
> return z;
> }
>
> +extern int zone_nomerge_order __read_mostly;
> +extern int zone_nosplit_order __read_mostly;
> +
> +static inline enum zone_type gfp_order_zone(gfp_t flags, int order)
> +{
> + enum zone_type zid = gfp_zone(flags);
> +
> + if (zid >= ZONE_NOMERGE && order != zone_nomerge_order)
> + zid = ZONE_NOMERGE - 1;
> +
> + if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order)
> + zid = ZONE_NOSPLIT - 1;
> +
> + return zid;
> +}
> +
> /*
> * There is only one page-allocator function, and two main namespaces to
> * it. The alloc_page*() variants return 'struct page *' and as such
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..9960ad7c3b10 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> unsigned long len, unsigned long pgoff, unsigned long flags);
>
> void folio_prep_large_rmappable(struct folio *folio);
> -bool can_split_folio(struct folio *folio, int *pextra_pins);
> int split_huge_page_to_list(struct page *page, struct list_head *list);
> static inline int split_huge_page(struct page *page)
> {
> @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {}
>
> #define thp_get_unmapped_area NULL
>
> -static inline bool
> -can_split_folio(struct folio *folio, int *pextra_pins)
> -{
> - return false;
> -}
> static inline int
> split_huge_page_to_list(struct page *page, struct list_head *list)
> {
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..a92bcf47cf8c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,7 +150,7 @@ extern enum zone_type policy_zone;
>
> static inline void check_highest_zone(enum zone_type k)
> {
> - if (k > policy_zone && k != ZONE_MOVABLE)
> + if (k > policy_zone && !zid_is_virt(k))
> policy_zone = k;
> }
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a497f189d988..532218167bba 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -805,11 +805,15 @@ enum zone_type {
> * there can be false negatives).
> */
> ZONE_MOVABLE,
> + ZONE_NOSPLIT,
> + ZONE_NOMERGE,
> #ifdef CONFIG_ZONE_DEVICE
> ZONE_DEVICE,
> #endif
> - __MAX_NR_ZONES
> + __MAX_NR_ZONES,
>
> + LAST_PHYS_ZONE = ZONE_MOVABLE - 1,
> + LAST_VIRT_ZONE = ZONE_NOMERGE,
> };
>
> #ifndef __GENERATING_BOUNDS_H
> @@ -929,6 +933,8 @@ struct zone {
> seqlock_t span_seqlock;
> #endif
>
> + int order;
> +
> int initialized;
>
> /* Write-intensive fields used from the page allocator */
> @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio)
>
> static inline bool is_zone_movable_page(const struct page *page)
> {
> - return page_zonenum(page) == ZONE_MOVABLE;
> + return page_zonenum(page) >= ZONE_MOVABLE;
> }
>
> static inline bool folio_is_zone_movable(const struct folio *folio)
> {
> - return folio_zonenum(folio) == ZONE_MOVABLE;
> + return folio_zonenum(folio) >= ZONE_MOVABLE;
> +}
> +
> +static inline bool page_can_split(struct page *page)
> +{
> + return page_zonenum(page) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool folio_can_split(struct folio *folio)
> +{
> + return folio_zonenum(folio) < ZONE_NOSPLIT;
> }
> #endif
>
> @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; };
> */
> #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
>
> +static inline bool zid_is_virt(enum zone_type zid)
> +{
> + return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE;
> +}
> +
> +static inline bool zone_can_frag(struct zone *zone)
> +{
> + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT);
> +
> + return zone_idx(zone) < ZONE_NOSPLIT;
> +}
> +
> +static inline bool zone_is_suitable(struct zone *zone, int order)
> +{
> + int zid = zone_idx(zone);
> +
> + if (zid < ZONE_NOSPLIT)
> + return true;
> +
> + if (!zone->order)
> + return false;
> +
> + return (zid == ZONE_NOSPLIT && order >= zone->order) ||
> + (zid == ZONE_NOMERGE && order == zone->order);
> +}
> +
> #ifdef CONFIG_ZONE_DEVICE
> static inline bool zone_is_zone_device(struct zone *zone)
> {
> @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone)
> static inline void zone_set_nid(struct zone *zone, int nid) {}
> #endif
>
> -extern int movable_zone;
> +extern int virt_zone;
>
> static inline int is_highmem_idx(enum zone_type idx)
> {
> #ifdef CONFIG_HIGHMEM
> return (idx == ZONE_HIGHMEM ||
> - (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM));
> + (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM));
> #else
> return 0;
> #endif
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index b61438313a73..34fbe910576d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -404,7 +404,7 @@ enum node_states {
> #else
> N_HIGH_MEMORY = N_NORMAL_MEMORY,
> #endif
> - N_MEMORY, /* The node has memory(regular, high, movable) */
> + N_MEMORY, /* The node has memory in any of the zones */
> N_CPU, /* The node has one or more cpus */
> N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */
> NR_NODE_STATES
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..9a54d15d5ec3 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -27,7 +27,7 @@
> #endif
>
> #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \
> - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx)
> + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx)
>
> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> FOR_ALL_ZONES(PGALLOC)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index d801409b33cf..2b5fdafaadea 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
> IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \
> EM (ZONE_NORMAL, "Normal") \
> IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \
> - EMe(ZONE_MOVABLE,"Movable")
> + EM (ZONE_MOVABLE,"Movable") \
> + EM (ZONE_NOSPLIT,"NoSplit") \
> + EMe(ZONE_NOMERGE,"NoMerge")
>
> #define LRU_NAMES \
> EM (LRU_INACTIVE_ANON, "inactive_anon") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..8a64c805f411 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> ac->highest_zoneidx, ac->nodemask) {
> enum compact_result status;
>
> + if (!zone_can_frag(zone))
> + continue;
> +
> if (prio > MIN_COMPACT_PRIORITY
> && compaction_deferred(zone, order)) {
> rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
> @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat)
> if (!populated_zone(zone))
> continue;
>
> + if (!zone_can_frag(zone))
> + continue;
> +
> cc.zone = zone;
>
> compact_zone(&cc, NULL);
> @@ -2846,6 +2852,9 @@ static void compact_node(int nid)
> if (!populated_zone(zone))
> continue;
>
> + if (!zone_can_frag(zone))
> + continue;
> +
> cc.zone = zone;
>
> compact_zone(&cc, NULL);
> @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
> if (!populated_zone(zone))
> continue;
>
> + if (!zone_can_frag(zone))
> + continue;
> +
> ret = compaction_suit_allocation_order(zone,
> pgdat->kcompactd_max_order,
> highest_zoneidx, ALLOC_WMARK_MIN);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c958f7ebb5..b57faa0a1e83 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> }
>
> /* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int *pextra_pins)
> +static bool can_split_folio(struct folio *folio, int *pextra_pins)
> {
> int extra_pins;
>
> + if (!folio_can_split(folio))
> + return false;
> +
> /* Additional pins from page cache */
> if (folio_test_anon(folio))
> extra_pins = folio_test_swapcache(folio) ?
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..1f84dd759086 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma)
>
> bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> {
> - enum zone_type dynamic_policy_zone = policy_zone;
> -
> - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
> + WARN_ON_ONCE(zid_is_virt(policy_zone));
>
> /*
> - * if policy->nodes has movable memory only,
> - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
> + * If policy->nodes has memory in virtual zones only, we apply policy
> + * only if gfp_zone(gfp) can allocate from those zones.
> *
> * policy->nodes is intersect with node_states[N_MEMORY].
> * so if the following test fails, it implies
> - * policy->nodes has movable memory only.
> + * policy->nodes has memory in virtual zones only.
> */
> if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
> - dynamic_policy_zone = ZONE_MOVABLE;
> + return zone > LAST_PHYS_ZONE;
>
> - return zone >= dynamic_policy_zone;
> + return zone >= policy_zone;
> }
>
> /* Do dynamic interleaving for a process */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cc9f2bcd73b4..f615c0c22046 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> {
> int rc;
>
> + if (!folio_can_split(folio))
> + return -EBUSY;
> +
> folio_lock(folio);
> rc = split_folio_to_list(folio, split_folios);
> folio_unlock(folio);
> @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private)
> order = folio_order(src);
> }
> zidx = zone_idx(folio_zone(src));
> - if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> + if (zidx > ZONE_NORMAL)
> gfp_mask |= __GFP_HIGHMEM;
>
> return __folio_alloc(gfp_mask, order, nid, mtc->nmask);
> @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
> break;
> }
> wakeup_kswapd(pgdat->node_zones + z, 0,
> - folio_order(folio), ZONE_MOVABLE);
> + folio_order(folio), z);
> return 0;
> }
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36..7769c21e6d54 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init);
>
> static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata;
> static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata;
> -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
>
> -static unsigned long required_kernelcore __initdata;
> -static unsigned long required_kernelcore_percent __initdata;
> -static unsigned long required_movablecore __initdata;
> -static unsigned long required_movablecore_percent __initdata;
> +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata;
> +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid])
> +
> +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE])
> +
> +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata;
> +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE])
> +
> +int zone_nosplit_order __read_mostly;
> +int zone_nomerge_order __read_mostly;
>
> static unsigned long nr_kernel_pages __initdata;
> static unsigned long nr_all_pages __initdata;
> @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p)
> return 0;
> }
>
> - return cmdline_parse_core(p, &required_kernelcore,
> - &required_kernelcore_percent);
> + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE),
> + &percentage_of(LAST_PHYS_ZONE));
> }
> early_param("kernelcore", cmdline_parse_kernelcore);
>
> @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore);
> */
> static int __init cmdline_parse_movablecore(char *p)
> {
> - return cmdline_parse_core(p, &required_movablecore,
> - &required_movablecore_percent);
> + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE),
> + &percentage_of(ZONE_MOVABLE));
> }
> early_param("movablecore", cmdline_parse_movablecore);
>
> +static int __init parse_zone_order(char *p, unsigned long *nr_pages,
> + unsigned long *percent, int *order)
> +{
> + int err;
> + unsigned long n;
> + char *s = strchr(p, ',');
> +
> + if (!s)
> + return -EINVAL;
> +
> + *s++ = '\0';
> +
> + err = kstrtoul(s, 0, &n);
> + if (err)
> + return err;
> +
> + if (n < 2 || n > MAX_PAGE_ORDER)
> + return -EINVAL;
> +
> + err = cmdline_parse_core(p, nr_pages, percent);
> + if (err)
> + return err;
> +
> + *order = n;
> +
> + return 0;
> +}
> +
> +static int __init parse_zone_nosplit(char *p)
> +{
> + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT),
> + &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order);
> +}
> +early_param("nosplit", parse_zone_nosplit);
> +
> +static int __init parse_zone_nomerge(char *p)
> +{
> + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE),
> + &percentage_of(ZONE_NOMERGE), &zone_nomerge_order);
> +}
> +early_param("nomerge", parse_zone_nomerge);
> +
> /*
> * early_calculate_totalpages()
> - * Sum pages in active regions for movable zone.
> + * Sum pages in active regions for virtual zones.
> * Populate N_MEMORY for calculating usable_nodes.
> */
> static unsigned long __init early_calculate_totalpages(void)
> @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void)
> }
>
> /*
> - * This finds a zone that can be used for ZONE_MOVABLE pages. The
> + * This finds a physical zone that can be used for virtual zones. The
> * assumption is made that zones within a node are ordered in monotonic
> * increasing memory addresses so that the "highest" populated zone is used
> */
> -static void __init find_usable_zone_for_movable(void)
> +static void __init find_usable_zone(void)
> {
> int zone_index;
> - for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
> - if (zone_index == ZONE_MOVABLE)
> - continue;
> -
> + for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) {
> if (arch_zone_highest_possible_pfn[zone_index] >
> arch_zone_lowest_possible_pfn[zone_index])
> break;
> }
>
> VM_BUG_ON(zone_index == -1);
> - movable_zone = zone_index;
> + virt_zone = zone_index;
> +}
> +
> +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn)
> +{
> + int i, nid;
> + unsigned long node_avg, remaining;
> + int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> + /* usable_startpfn is the lowest possible pfn virtual zones can be at */
> + unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone];
> +
> +restart:
> + /* Carve out memory as evenly as possible throughout nodes */
> + node_avg = occupied / usable_nodes;
> + for_each_node_state(nid, N_MEMORY) {
> + unsigned long start_pfn, end_pfn;
> +
> + /*
> + * Recalculate node_avg if the division per node now exceeds
> + * what is necessary to satisfy the amount of memory to carve
> + * out.
> + */
> + if (occupied < node_avg)
> + node_avg = occupied / usable_nodes;
> +
> + /*
> + * As the map is walked, we track how much memory is usable
> + * using remaining. When it is 0, the rest of the node is
> + * usable.
> + */
> + remaining = node_avg;
> +
> + /* Go through each range of PFNs within this node */
> + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> + unsigned long size_pages;
> +
> + start_pfn = max(start_pfn, zone_pfn[nid]);
> + if (start_pfn >= end_pfn)
> + continue;
> +
> + /* Account for what is only usable when carving out */
> + if (start_pfn < usable_startpfn) {
> + unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn;
> +
> + remaining -= min(nr_pages, remaining);
> + occupied -= min(nr_pages, occupied);
> +
> + /* Continue if range is now fully accounted */
> + if (end_pfn <= usable_startpfn) {
> +
> + /*
> + * Push zone_pfn to the end so that if
> + * we have to carve out more across
> + * nodes, we will not double account
> + * here.
> + */
> + zone_pfn[nid] = end_pfn;
> + continue;
> + }
> + start_pfn = usable_startpfn;
> + }
> +
> + /*
> + * The usable PFN range is from start_pfn->end_pfn.
> + * Calculate size_pages as the number of pages used.
> + */
> + size_pages = end_pfn - start_pfn;
> + if (size_pages > remaining)
> + size_pages = remaining;
> + zone_pfn[nid] = start_pfn + size_pages;
> +
> + /*
> + * Some memory was carved out, update counts and break
> + * if the request for this node has been satisfied.
> + */
> + occupied -= min(occupied, size_pages);
> + remaining -= size_pages;
> + if (!remaining)
> + break;
> + }
> + }
> +
> + /*
> + * If there is still more to carve out, we do another pass with one less
> + * node in the count. This will push zone_pfn[nid] further along on the
> + * nodes that still have memory until the request is fully satisfied.
> + */
> + usable_nodes--;
> + if (usable_nodes && occupied > usable_nodes)
> + goto restart;
> }
>
> /*
> @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void)
> * memory. When they don't, some nodes will have more kernelcore than
> * others
> */
> -static void __init find_zone_movable_pfns_for_nodes(void)
> +static void __init find_virt_zones(void)
> {
> - int i, nid;
> + int i;
> + int nid;
> unsigned long usable_startpfn;
> - unsigned long kernelcore_node, kernelcore_remaining;
> /* save the state before borrow the nodemask */
> nodemask_t saved_node_state = node_states[N_MEMORY];
> unsigned long totalpages = early_calculate_totalpages();
> - int usable_nodes = nodes_weight(node_states[N_MEMORY]);
> struct memblock_region *r;
> + unsigned long occupied = 0;
>
> - /* Need to find movable_zone earlier when movable_node is specified. */
> - find_usable_zone_for_movable();
> + /* Need to find virt_zone earlier when movable_node is specified. */
> + find_usable_zone();
>
> /*
> * If movable_node is specified, ignore kernelcore and movablecore
> @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
> nid = memblock_get_region_node(r);
>
> usable_startpfn = PFN_DOWN(r->base);
> - zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> - min(usable_startpfn, zone_movable_pfn[nid]) :
> + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
> usable_startpfn;
> }
>
> @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void)
> continue;
> }
>
> - zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> - min(usable_startpfn, zone_movable_pfn[nid]) :
> + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ?
> + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) :
> usable_startpfn;
> }
>
> @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void)
> goto out2;
> }
>
> + if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) {
> + nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0;
> + percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0;
> + zone_nosplit_order = zone_nomerge_order = 0;
> + pr_warn("zone %s order %d must be higher zone %s order %d\n",
> + zone_names[ZONE_NOMERGE], zone_nomerge_order,
> + zone_names[ZONE_NOSPLIT], zone_nosplit_order);
> + }
> +
> /*
> * If kernelcore=nn% or movablecore=nn% was specified, calculate the
> * amount of necessary memory.
> */
> - if (required_kernelcore_percent)
> - required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
> - 10000UL;
> - if (required_movablecore_percent)
> - required_movablecore = (totalpages * 100 * required_movablecore_percent) /
> - 10000UL;
> + for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) {
> + if (percentage_of(i))
> + nr_pages_of(i) = totalpages * percentage_of(i) / 100;
> +
> + nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES);
> + occupied += nr_pages_of(i);
> + }
>
> /*
> * If movablecore= was specified, calculate what size of
> * kernelcore that corresponds so that memory usable for
> * any allocation type is evenly spread. If both kernelcore
> * and movablecore are specified, then the value of kernelcore
> - * will be used for required_kernelcore if it's greater than
> - * what movablecore would have allowed.
> + * will be used if it's greater than what movablecore would have
> + * allowed.
> */
> - if (required_movablecore) {
> - unsigned long corepages;
> + if (occupied < totalpages) {
> + enum zone_type zid;
>
> - /*
> - * Round-up so that ZONE_MOVABLE is at least as large as what
> - * was requested by the user
> - */
> - required_movablecore =
> - roundup(required_movablecore, MAX_ORDER_NR_PAGES);
> - required_movablecore = min(totalpages, required_movablecore);
> - corepages = totalpages - required_movablecore;
> -
> - required_kernelcore = max(required_kernelcore, corepages);
> + zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ?
> + LAST_PHYS_ZONE : ZONE_MOVABLE;
> + nr_pages_of(zid) += totalpages - occupied;
> }
>
> /*
> * If kernelcore was not specified or kernelcore size is larger
> - * than totalpages, there is no ZONE_MOVABLE.
> + * than totalpages, there are not virtual zones.
> */
> - if (!required_kernelcore || required_kernelcore >= totalpages)
> + occupied = nr_pages_of(LAST_PHYS_ZONE);
> + if (!occupied || occupied >= totalpages)
> goto out;
>
> - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
> - usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
> + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> + if (!nr_pages_of(i))
> + continue;
>
> -restart:
> - /* Spread kernelcore memory as evenly as possible throughout nodes */
> - kernelcore_node = required_kernelcore / usable_nodes;
> - for_each_node_state(nid, N_MEMORY) {
> - unsigned long start_pfn, end_pfn;
> -
> - /*
> - * Recalculate kernelcore_node if the division per node
> - * now exceeds what is necessary to satisfy the requested
> - * amount of memory for the kernel
> - */
> - if (required_kernelcore < kernelcore_node)
> - kernelcore_node = required_kernelcore / usable_nodes;
> -
> - /*
> - * As the map is walked, we track how much memory is usable
> - * by the kernel using kernelcore_remaining. When it is
> - * 0, the rest of the node is usable by ZONE_MOVABLE
> - */
> - kernelcore_remaining = kernelcore_node;
> -
> - /* Go through each range of PFNs within this node */
> - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> - unsigned long size_pages;
> -
> - start_pfn = max(start_pfn, zone_movable_pfn[nid]);
> - if (start_pfn >= end_pfn)
> - continue;
> -
> - /* Account for what is only usable for kernelcore */
> - if (start_pfn < usable_startpfn) {
> - unsigned long kernel_pages;
> - kernel_pages = min(end_pfn, usable_startpfn)
> - - start_pfn;
> -
> - kernelcore_remaining -= min(kernel_pages,
> - kernelcore_remaining);
> - required_kernelcore -= min(kernel_pages,
> - required_kernelcore);
> -
> - /* Continue if range is now fully accounted */
> - if (end_pfn <= usable_startpfn) {
> -
> - /*
> - * Push zone_movable_pfn to the end so
> - * that if we have to rebalance
> - * kernelcore across nodes, we will
> - * not double account here
> - */
> - zone_movable_pfn[nid] = end_pfn;
> - continue;
> - }
> - start_pfn = usable_startpfn;
> - }
> -
> - /*
> - * The usable PFN range for ZONE_MOVABLE is from
> - * start_pfn->end_pfn. Calculate size_pages as the
> - * number of pages used as kernelcore
> - */
> - size_pages = end_pfn - start_pfn;
> - if (size_pages > kernelcore_remaining)
> - size_pages = kernelcore_remaining;
> - zone_movable_pfn[nid] = start_pfn + size_pages;
> -
> - /*
> - * Some kernelcore has been met, update counts and
> - * break if the kernelcore for this node has been
> - * satisfied
> - */
> - required_kernelcore -= min(required_kernelcore,
> - size_pages);
> - kernelcore_remaining -= size_pages;
> - if (!kernelcore_remaining)
> - break;
> - }
> + find_virt_zone(occupied, &pfn_of(i, 0));
> + occupied += nr_pages_of(i);
> }
> -
> - /*
> - * If there is still required_kernelcore, we do another pass with one
> - * less node in the count. This will push zone_movable_pfn[nid] further
> - * along on the nodes that still have memory until kernelcore is
> - * satisfied
> - */
> - usable_nodes--;
> - if (usable_nodes && required_kernelcore > usable_nodes)
> - goto restart;
> -
> out2:
> - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
> + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */
> for (nid = 0; nid < MAX_NUMNODES; nid++) {
> unsigned long start_pfn, end_pfn;
> -
> - zone_movable_pfn[nid] =
> - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
> + unsigned long prev_virt_zone_pfn = 0;
>
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> - if (zone_movable_pfn[nid] >= end_pfn)
> - zone_movable_pfn[nid] = 0;
> +
> + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) {
> + pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES);
> +
> + if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn)
> + pfn_of(i, nid) = 0;
> +
> + if (pfn_of(i, nid))
> + prev_virt_zone_pfn = pfn_of(i, nid);
> + }
> }
> -
> out:
> /* restore the node_state */
> node_states[N_MEMORY] = saved_node_state;
> @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone,
> #endif
>
> /*
> - * The zone ranges provided by the architecture do not include ZONE_MOVABLE
> - * because it is sized independent of architecture. Unlike the other zones,
> - * the starting point for ZONE_MOVABLE is not fixed. It may be different
> - * in each node depending on the size of each node and how evenly kernelcore
> - * is distributed. This helper function adjusts the zone ranges
> + * The zone ranges provided by the architecture do not include virtual zones
> + * because they are sized independent of architecture. Unlike physical zones,
> + * the starting point for the first populated virtual zone is not fixed. It may
> + * be different in each node depending on the size of each node and how evenly
> + * kernelcore is distributed. This helper function adjusts the zone ranges
> * provided by the architecture for a given node by using the end of the
> - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
> - * zones within a node are in order of monotonic increases memory addresses
> + * highest usable zone for the first populated virtual zone. This preserves the
> + * assumption that zones within a node are in order of monotonic increases
> + * memory addresses.
> */
> -static void __init adjust_zone_range_for_zone_movable(int nid,
> +static void __init adjust_zone_range(int nid,
> unsigned long zone_type,
> unsigned long node_end_pfn,
> unsigned long *zone_start_pfn,
> unsigned long *zone_end_pfn)
> {
> - /* Only adjust if ZONE_MOVABLE is on this node */
> - if (zone_movable_pfn[nid]) {
> - /* Size ZONE_MOVABLE */
> - if (zone_type == ZONE_MOVABLE) {
> - *zone_start_pfn = zone_movable_pfn[nid];
> - *zone_end_pfn = min(node_end_pfn,
> - arch_zone_highest_possible_pfn[movable_zone]);
> + int i = max_t(int, zone_type, LAST_PHYS_ZONE);
> + unsigned long next_virt_zone_pfn = 0;
>
> - /* Adjust for ZONE_MOVABLE starting within this range */
> - } else if (!mirrored_kernelcore &&
> - *zone_start_pfn < zone_movable_pfn[nid] &&
> - *zone_end_pfn > zone_movable_pfn[nid]) {
> - *zone_end_pfn = zone_movable_pfn[nid];
> + while (i++ < LAST_VIRT_ZONE) {
> + if (pfn_of(i, nid)) {
> + next_virt_zone_pfn = pfn_of(i, nid);
> + break;
> + }
> + }
>
> - /* Check if this whole range is within ZONE_MOVABLE */
> - } else if (*zone_start_pfn >= zone_movable_pfn[nid])
> + if (zone_type <= LAST_PHYS_ZONE) {
> + if (!next_virt_zone_pfn)
> + return;
> +
> + if (!mirrored_kernelcore &&
> + *zone_start_pfn < next_virt_zone_pfn &&
> + *zone_end_pfn > next_virt_zone_pfn)
> + *zone_end_pfn = next_virt_zone_pfn;
> + else if (*zone_start_pfn >= next_virt_zone_pfn)
> *zone_start_pfn = *zone_end_pfn;
> + } else if (zone_type <= LAST_VIRT_ZONE) {
> + if (!pfn_of(zone_type, nid))
> + return;
> +
> + if (next_virt_zone_pfn)
> + *zone_end_pfn = min3(next_virt_zone_pfn,
> + node_end_pfn,
> + arch_zone_highest_possible_pfn[virt_zone]);
> + else
> + *zone_end_pfn = min(node_end_pfn,
> + arch_zone_highest_possible_pfn[virt_zone]);
> + *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid));
> }
> }
>
> @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
> * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
> * and vice versa.
> */
> - if (mirrored_kernelcore && zone_movable_pfn[nid]) {
> + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) {
> unsigned long start_pfn, end_pfn;
> struct memblock_region *r;
>
> @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
> /* Get the start and end of the zone */
> *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
> *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
> - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn,
> - zone_start_pfn, zone_end_pfn);
> + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn);
>
> /* Check that this node has pages within the zone's required range */
> if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
> @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> #if defined(CONFIG_MEMORY_HOTPLUG)
> zone->present_early_pages = real_size;
> #endif
> + if (i == ZONE_NOSPLIT)
> + zone->order = zone_nosplit_order;
> + if (i == ZONE_NOMERGE)
> + zone->order = zone_nomerge_order;
>
> totalpages += spanned;
> realtotalpages += real_size;
> @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat)
> {
> enum zone_type zone_type;
>
> - for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
> + for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) {
> struct zone *zone = &pgdat->node_zones[zone_type];
> if (populated_zone(zone)) {
> if (IS_ENABLED(CONFIG_HIGHMEM))
> @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void)
> void __init free_area_init(unsigned long *max_zone_pfn)
> {
> unsigned long start_pfn, end_pfn;
> - int i, nid, zone;
> + int i, j, nid, zone;
> bool descending;
>
> /* Record where the zone boundaries are */
> @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> start_pfn = PHYS_PFN(memblock_start_of_DRAM());
> descending = arch_has_descending_max_zone_pfns();
>
> - for (i = 0; i < MAX_NR_ZONES; i++) {
> + for (i = 0; i <= LAST_PHYS_ZONE; i++) {
> if (descending)
> - zone = MAX_NR_ZONES - i - 1;
> + zone = LAST_PHYS_ZONE - i;
> else
> zone = i;
>
> - if (zone == ZONE_MOVABLE)
> - continue;
> -
> end_pfn = max(max_zone_pfn[zone], start_pfn);
> arch_zone_lowest_possible_pfn[zone] = start_pfn;
> arch_zone_highest_possible_pfn[zone] = end_pfn;
> @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> start_pfn = end_pfn;
> }
>
> - /* Find the PFNs that ZONE_MOVABLE begins at in each node */
> - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
> - find_zone_movable_pfns_for_nodes();
> + /* Find the PFNs that virtual zones begin at in each node */
> + find_virt_zones();
>
> /* Print out the zone ranges */
> pr_info("Zone ranges:\n");
> - for (i = 0; i < MAX_NR_ZONES; i++) {
> - if (i == ZONE_MOVABLE)
> - continue;
> + for (i = 0; i <= LAST_PHYS_ZONE; i++) {
> pr_info(" %-8s ", zone_names[i]);
> if (arch_zone_lowest_possible_pfn[i] ==
> arch_zone_highest_possible_pfn[i])
> @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> << PAGE_SHIFT) - 1);
> }
>
> - /* Print out the PFNs ZONE_MOVABLE begins at in each node */
> - pr_info("Movable zone start for each node\n");
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - if (zone_movable_pfn[i])
> - pr_info(" Node %d: %#018Lx\n", i,
> - (u64)zone_movable_pfn[i] << PAGE_SHIFT);
> + /* Print out the PFNs virtual zones begin at in each node */
> + for (; i <= LAST_VIRT_ZONE; i++) {
> + pr_info("%s zone start for each node\n", zone_names[i]);
> + for (j = 0; j < MAX_NUMNODES; j++) {
> + if (pfn_of(i, j))
> + pr_info(" Node %d: %#018Lx\n",
> + j, (u64)pfn_of(i, j) << PAGE_SHIFT);
> + }
> }
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..6a4da8f8691c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = {
> "HighMem",
> #endif
> "Movable",
> + "NoSplit",
> + "NoMerge",
> #ifdef CONFIG_ZONE_DEVICE
> "Device",
> #endif
> @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1;
> static int watermark_boost_factor __read_mostly = 15000;
> static int watermark_scale_factor = 10;
>
> -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> -int movable_zone;
> -EXPORT_SYMBOL(movable_zone);
> +/* virt_zone is the "real" zone pages in virtual zones are taken from */
> +int virt_zone;
> +EXPORT_SYMBOL(virt_zone);
>
> #if MAX_NUMNODES > 1
> unsigned int nr_node_ids __read_mostly = MAX_NUMNODES;
> @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> unsigned long higher_page_pfn;
> struct page *higher_page;
>
> - if (order >= MAX_PAGE_ORDER - 1)
> - return false;
> -
> higher_page_pfn = buddy_pfn & pfn;
> higher_page = page + (higher_page_pfn - pfn);
>
> @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> NULL) != NULL;
> }
>
> +static int zone_max_order(struct zone *zone)
> +{
> + return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER;
> +}
> +
> /*
> * Freeing function for a buddy system allocator.
> *
> @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page,
> unsigned long combined_pfn;
> struct page *buddy;
> bool to_tail;
> + int max_order = zone_max_order(zone);
>
> VM_BUG_ON(!zone_is_initialized(zone));
> VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page,
> VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> - while (order < MAX_PAGE_ORDER) {
> + while (order < max_order) {
> if (compaction_capture(capc, page, order, migratetype)) {
> __mod_zone_freepage_state(zone, -(1 << order),
> migratetype);
> @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page,
> to_tail = true;
> else if (is_shuffle_order(order))
> to_tail = shuffle_pick_tail();
> + else if (order + 1 >= max_order)
> + to_tail = false;
> else
> to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
>
> @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page,
> int mt;
> int ret = 0;
>
> + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page);
> +
> if (split_pfn_offset == 0)
> return ret;
>
> @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> struct free_area *area;
> struct page *page;
>
> + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order));
> +
> /* Find a page of the appropriate size in the preferred list */
> for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
> area = &(zone->free_area[current_order]);
> @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> long min = mark;
> int o;
>
> + if (!zone_is_suitable(z, order))
> + return false;
> +
> /* free_pages may go negative - that's OK */
> free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>
> @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> {
> long free_pages;
>
> + if (!zone_is_suitable(z, order))
> + return false;
> +
> free_pages = zone_page_state(z, NR_FREE_PAGES);
>
> /*
> @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> struct page *page;
> unsigned long mark;
>
> + if (!zone_is_suitable(zone, order))
> + continue;
> +
> if (cpusets_enabled() &&
> (alloc_flags & ALLOC_CPUSET) &&
> !__cpuset_zone_allowed(zone, gfp_mask))
> @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void)
> struct zone *zone;
> unsigned long flags;
>
> - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */
> + /* Calculate total number of pages below ZONE_HIGHMEM */
> for_each_zone(zone) {
> - if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
> + if (zone_idx(zone) <= ZONE_NORMAL)
> lowmem_pages += zone_managed_pages(zone);
> }
>
> @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void)
> spin_lock_irqsave(&zone->lock, flags);
> tmp = (u64)pages_min * zone_managed_pages(zone);
> do_div(tmp, lowmem_pages);
> - if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
> + if (zone_idx(zone) > ZONE_NORMAL) {
> /*
> * __GFP_HIGH and PF_MEMALLOC allocations usually don't
> - * need highmem and movable zones pages, so cap pages_min
> - * to a small value here.
> + * need pages from zones above ZONE_NORMAL, so cap
> + * pages_min to a small value here.
> *
> * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
> * deltas control async page reclaim, and so should
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index cd0ea3668253..8a6473543427 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> * pages then it should be reasonably safe to assume the rest
> * is movable.
> */
> - if (zone_idx(zone) == ZONE_MOVABLE)
> + if (zid_is_virt(zone_idx(zone)))
> continue;
>
> /*
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..ad0db0373b05 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> entry.val = 0;
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() &&
> + folio_test_pmd_mappable(folio))
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..ae061ec4866a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> goto keep_locked;
> if (folio_maybe_dma_pinned(folio))
> goto keep_locked;
> - if (folio_test_large(folio)) {
> - /* cannot split folio, skip it */
> - if (!can_split_folio(folio, NULL))
> - goto activate_locked;
> - /*
> - * Split folios without a PMD map right
> - * away. Chances are some or all of the
> - * tail pages can be freed without IO.
> - */
> - if (!folio_entire_mapcount(folio) &&
> - split_folio_to_list(folio,
> - folio_list))
> - goto activate_locked;
> - }
> + /*
> + * Split folios that are not fully map right
> + * away. Chances are some of the tail pages can
> + * be freed without IO.
> + */
> + if (folio_test_large(folio) &&
> + atomic_read(&folio->_nr_pages_mapped) < nr_pages)
> + split_folio_to_list(folio, folio_list);
> if (!add_to_swap(folio)) {
> if (!folio_test_large(folio))
> goto activate_locked_split;
> @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> orig_mask = sc->gfp_mask;
> if (buffer_heads_over_limit) {
> sc->gfp_mask |= __GFP_HIGHMEM;
> - sc->reclaim_idx = gfp_zone(sc->gfp_mask);
> + sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order);
> }
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> struct scan_control sc = {
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .gfp_mask = current_gfp_context(gfp_mask),
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .reclaim_idx = gfp_order_zone(gfp_mask, order),
> .order = order,
> .nodemask = nodemask,
> .priority = DEF_PRIORITY,
> @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
> if (!cpuset_zone_allowed(zone, gfp_flags))
> return;
>
> + curr_idx = gfp_order_zone(gfp_flags, order);
> + if (highest_zoneidx > curr_idx)
> + highest_zoneidx = curr_idx;
> +
> pgdat = zone->zone_pgdat;
> curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx);
>
> @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> .may_swap = 1,
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .reclaim_idx = gfp_order_zone(gfp_mask, order),
> };
> unsigned long pflags;
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..adbd032e6a0f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>
> #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
> TEXT_FOR_HIGHMEM(xx) xx "_movable", \
> + xx "_nosplit", xx "_nomerge", \
> TEXT_FOR_DEVICE(xx)
>
> const char * const vmstat_text[] = {
> @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> "\n spanned %lu"
> "\n present %lu"
> "\n managed %lu"
> - "\n cma %lu",
> + "\n cma %lu"
> + "\n order %u",
> zone_page_state(zone, NR_FREE_PAGES),
> zone->watermark_boost,
> min_wmark_pages(zone),
> @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> zone->spanned_pages,
> zone->present_pages,
> zone_managed_pages(zone),
> - zone_cma_pages(zone));
> + zone_cma_pages(zone),
> + zone->order);
>
> seq_printf(m,
> "\n protection: (%ld",
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>
next prev parent reply other threads:[~2024-02-29 23:31 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-29 18:34 [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Yu Zhao
2024-02-29 18:34 ` [Chapter One] THP zones: the use cases of policy zones Yu Zhao
2024-02-29 20:28 ` Matthew Wilcox
2024-03-06 3:51 ` Yu Zhao
2024-03-06 4:33 ` Matthew Wilcox
2024-02-29 23:31 ` Yang Shi [this message]
2024-03-03 2:47 ` Yu Zhao
2024-03-04 15:19 ` Matthew Wilcox
2024-03-05 17:22 ` Matthew Wilcox
2024-03-05 8:41 ` Barry Song
2024-03-05 10:07 ` Vlastimil Babka
2024-03-05 21:04 ` Barry Song
2024-03-06 3:05 ` Yu Zhao
2024-05-24 8:38 ` Barry Song
2024-11-01 2:35 ` Charan Teja Kalla
2024-11-01 16:55 ` Yu Zhao
2024-02-29 18:34 ` [Chapter Two] THP shattering: the reverse of collapsing Yu Zhao
2024-02-29 21:55 ` Zi Yan
2024-03-03 1:17 ` Yu Zhao
2024-03-03 1:21 ` Zi Yan
2024-06-11 8:32 ` Barry Song
2024-02-29 18:34 ` [Chapter Three] THP HVO: bring the hugeTLB feature to THP Yu Zhao
2024-02-29 22:54 ` Yang Shi
2024-03-01 15:42 ` David Hildenbrand
2024-03-03 1:46 ` Yu Zhao
2024-02-29 18:34 ` [Epilogue] Profile-Guided Heap Optimization and THP fungibility Yu Zhao
2024-03-05 8:37 ` [LSF/MM/BPF TOPIC] TAO: THP Allocator Optimizations Barry Song
2024-03-06 15:51 ` Johannes Weiner
2024-03-06 16:40 ` Zi Yan
2024-03-13 22:09 ` Kaiyang Zhao
2024-05-15 21:17 ` Yu Zhao
2024-05-15 21:52 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHbLzkrpA1TfyLsOcqZ01KdR4-SjXpGrTOeJ+UjzeR_-2Feagw@mail.gmail.com \
--to=shy828301@gmail.com \
--cc=corbet@lwn.net \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox