From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75C01C48BF6 for ; Thu, 29 Feb 2024 23:31:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 079FC94000B; Thu, 29 Feb 2024 18:31:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 02A5D940007; Thu, 29 Feb 2024 18:31:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D52B094000B; Thu, 29 Feb 2024 18:31:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B9F08940007 for ; Thu, 29 Feb 2024 18:31:52 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 80DC7803D6 for ; Thu, 29 Feb 2024 23:31:52 +0000 (UTC) X-FDA: 81846441264.08.189CAC8 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf19.hostedemail.com (Postfix) with ESMTP id 7B1A41A0020 for ; Thu, 29 Feb 2024 23:31:50 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YOYI6tOj; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709249510; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=; b=uo5nIgGyEMabCdyX72BZxyBSEqF3u4e7bQ958t2p+0czEjvi5E4/KErJHtSMd3l6qBAVZB ejt7wvdxGZm0H42iDOvVsBlig0CYwQkVk/mPpWxRC/Hwj4SQr+zpsIX1vjcRFPuk7GgK7m Udu+7NwWGDKkx/sdbIx+DgVZvmyz6R8= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YOYI6tOj; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709249510; a=rsa-sha256; cv=none; b=PX3RUKsL/hvrdwuql5rbme4QidoKJAhI0YeR86UNUN4KVlMicUeqc7navkX/gjGt/LcQFf DP9Qm9UkfXd7x8X1Rn3StnMnNNBSBI+Ciz4aCrLVeXyaMqxGnuZ1QxswjM5G+YSdPwFcjt UcFuLDa9vJFn4nQYZc88IYaVhdrWL/s= Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-6e58d259601so915506b3a.3 for ; Thu, 29 Feb 2024 15:31:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709249509; x=1709854309; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=; b=YOYI6tOjMzzDCOwLQeeaH16M6zyYYTbmE9sjLVF4/xsH+JJX9hCLm9D7P7sWw7uGA0 vQCYvSD2Fh1lFgzwVoYDzGr1OE2ZiJ+NIRm0mUvyxL02EBVfkpY9hQ6TIHp+3zdrvhJJ nYXvazlOh7kn+o8+QliMNSi9LyI1ilq2xyzmU3tkuyRKHNXDQCFYfqP1Ojx9BUVQjipf x1Txde9eKKSM0aiuWuCnsLzTBtVn7C36IXjJiLrT63MEjPFnkqOYGNfKWW6e7s77kmqz ndtDHWS7/kQqVj0xmXurzOPMpX8hGEic8uUEj6pFpRDnmU57mYrRuwqNKnyspZjBypac Rxig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709249509; x=1709854309; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7+1VcvummFtqm5QWX5WAs9EWmCm969fxik6JZXiqkA0=; b=YiEIwU+3eRshKKSnPHKVMv+8IRteGxNXckhk1odgz6FZQwRCQjTlkNZgsKSaNYLaVt nXQH7FKBDLFGx8nzenT/MzQmyf7aSTAACK3RqXwMd8rz4vNbkiDELCk2h0Wf+MFSUWko hGxWy6MF+s/paKoVebO4/spRNgEP5+rILhKSdmRqASFYQtrLNxdM0ZxJ0w40JYaNLmSy rNBI+tDWwEvlDdbLRERXjnhFoznO9Sdjd1FxN+rafZeuWwEVHKGkK2BsLIh1vASur4iM GsyGrlImpTpwLMiEUR2fWBFGokSyNm5GWvCCV11iduC9FQOI6oV73rV1+wxO8FL/p8Nl YdKw== X-Forwarded-Encrypted: i=1; AJvYcCXEk/1AWIxByVD6PVjssp19kpQG+w2m02yc86rpWDkxsDyEpLfKIBwFi3ZDEZtNbn7Olsgz5oYwdpaEMrEr2Jm9nuc= X-Gm-Message-State: AOJu0YzPkDml2CB3xp4cfFnQJqTzGHk5QiMasdpZ4vGqMe7D8HTlsTle 392b8/NWUGengItqr9wUg57Abht0bhh1/VoF09LK0bAYMPM60odE2ztfZGhpeyvm5EpInrKqS23 I2BHNLaZC5sy2MoKAIxa2RTLng5t0dbEx X-Google-Smtp-Source: AGHT+IGLmB4TyowcnUsCaUJOnj0aoE17svyP0oLIZQy8qBDH/ITzJGAfdB7TYZ4R3WXgfQ3q07f1DhmtxLHY0UY0/os= X-Received: by 2002:a05:6a21:2d84:b0:1a0:df64:26c with SMTP id ty4-20020a056a212d8400b001a0df64026cmr4288392pzb.16.1709249508807; Thu, 29 Feb 2024 15:31:48 -0800 (PST) MIME-Version: 1.0 References: <20240229183436.4110845-1-yuzhao@google.com> <20240229183436.4110845-2-yuzhao@google.com> In-Reply-To: <20240229183436.4110845-2-yuzhao@google.com> From: Yang Shi Date: Thu, 29 Feb 2024 15:31:36 -0800 Message-ID: Subject: Re: [Chapter One] THP zones: the use cases of policy zones To: Yu Zhao Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Jonathan Corbet Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7B1A41A0020 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: y8bg7xdbw4djatj5xbqqcpd35bu6j6bk X-HE-Tag: 1709249510-586646 X-HE-Meta: U2FsdGVkX18ihU/Q5KbYMK8AUzPbSb72x0ROSSDlrThrWpfahOcD0Es8a70FIjuYqPykYferieWgYOuhqpBbXSg28Zsm2aNyd8DpRP8NzmOtMYVEyomE7YajOjgtFfy8oIriQZolX8Kr64r22zvPulhQLXM9h94gMEPb9RnbL2raiyjp3BZb7AY9VdXAlRVbpYbUVgLVA4FuaWpoH9gaCILJ3e3aqikotZi9VGfln3GcfTrwMViCHqZeaThJCQKOOKJpp2YOEwsczY42ZiRbnkmPGp5J7lJh5pG98kDZdiqlNLTo7mvWGhIaYCkN3IfccmcfQxqCGov4wVGKXNqWRisqjtnPfN9Eh8W9jq7JQY0bjo1yGBpWO3eoz9s1y4ZzgPBQZoDIyKu/iCAnTxvaW00xEW9xy0wOX6ifHMRMWJTG1f0xXjyOm9+GueCl2qBiOIvrk8QumcFN9qtjbVtW9LTH55D1FlkxRNbZrTy5JvUtGxkY6ZmhUwZ/ds5vDHyUXaZexqlTk1fsmtiXOGsGZa35zhVa3p9xCvcy7obO9HH3Eo4fFHDto7lBMH2aqv6oWtoaFyiGP5GnXVgZyp1w3hoaGrFiI1N6hVuGtnCkPSCRSCqFWLQkfNzTySJn+z7zR9eapTE57DbpW3QHjVXxhEbmLjXNQHuNAD+48NNfBMt7X/wDvZQxS9vaeNX9zVRW6ZXCBh3p0qs0Bt0K9JaXUAeSVXzlrHXNRMycz0OVDeprsFAnqCEAjCmQTqxxr8MZhM7EKfr+56CzcEWV1JLi+68ndYG9GUssUJBKjTPaz0T9VhWr/lyejsnD10n3yvRBTOYYKy7Ln3Z3moU6Q4nYMgPfGAie7rEuo/OUL44a9iAtRnfVv0T5WZoT8pJ7OFQYqaHKwQe3YsAOqDjmpaV4kICmoJus5KzzselcZjy5ZkfZo+GQvN6YwZava/R+HSJ4OiXz/2l7UNcrPC4aekv paMYUgBa Ma87/24TmBcz/K5a0gxdWadGC6A015y7E21wzHkUZsKh2i/ptPpGPgiKKGxtKVelL+CTuezsj99Upwd2f1enl5L9756gTB4tLjw6mRouteWzVaSoNDWqT0ZHoyoZgJmV1Whe20bTWuj6Evsx93gpn0Rp7vZNR6dU3GXmBG4GhnEOAnYYHzHVb0iiHQoMpXoluy+PmZnTCDXl+4OFpTE8Q2f94k/iiX+JsCmMDigIwCH8ZUjc/tNS0y3lI/ZzFIr5r+CLfL2ICYLJJ/8O8MUcN5Uap9798slJ0o0kHWkdtnN1if1QnG57vAtVdVi4S2zXBoLVomR08QObV+rqUXHAQfzy4HZWJwpVWGMyuWbq9F0s6d3XgKwDL1lsRf6DO99tUSiSkQCiP6XeVykvfz/zH9avdjnVs8PnY3aqPzPYqJhSpG+1SEEI91frkAOTi7fLx3HB29b1WZSkSW73XGtIX4nZqQQRUipT/uUZhTO0d2B7f8zfldGANdTBhlJs6dE0VqxfzoMHm1u+4zxMVxQLnJ70f659WK3hGnpx4P5zK/qXzkZM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 29, 2024 at 10:34=E2=80=AFAM Yu Zhao wrote: > > There are three types of zones: > 1. The first four zones partition the physical address space of CPU > memory. > 2. The device zone provides interoperability between CPU and device > memory. > 3. The movable zone commonly represents a memory allocation policy. > > Though originally designed for memory hot removal, the movable zone is > instead widely used for other purposes, e.g., CMA and kdump kernel, on > platforms that do not support hot removal, e.g., Android and ChromeOS. > Nowadays, it is legitimately a zone independent of any physical > characteristics. In spite of being somewhat regarded as a hack, > largely due to the lack of a generic design concept for its true major > use cases (on billions of client devices), the movable zone naturally > resembles a policy (virtual) zone overlayed on the first four > (physical) zones. > > This proposal formally generalizes this concept as policy zones so > that additional policies can be implemented and enforced by subsequent > zones after the movable zone. An inherited requirement of policy zones > (and the first four zones) is that subsequent zones must be able to > fall back to previous zones and therefore must add new properties to > the previous zones rather than remove existing ones from them. Also, > all properties must be known at the allocation time, rather than the > runtime, e.g., memory object size and mobility are valid properties > but hotness and lifetime are not. > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > zones: > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > ZONE_MOVABLE) and restricted to a minimum order to be > anti-fragmentation. The latter means that they cannot be split down > below that order, while they are free or in use. > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > to an exact order. The latter means that not only is split > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > reason in Chapter Three), while they are free or in use. > > Since these two zones only can serve THP allocations (__GFP_MOVABLE | > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and > compaction is not needed for these two zones. > > Compared with the hugeTLB pool approach, THP zones tap into core MM > features including: > 1. THP allocations can fall back to the lower zones, which can have > higher latency but still succeed. > 2. THPs can be either shattered (see Chapter Two) if partially > unmapped or reclaimed if becoming cold. > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > contiguous PTEs on arm64 [1], which are more suitable for client > workloads. I think the allocation fallback policy needs to be elaborated. IIUC, when allocating large folios, if the order > min order of the policy zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE -> ZONE_MOVABLE -> ZONE_NORMAL, right? If all other zones are depleted, the allocation, whose order is < the min order, won't fallback to the policy zones and will fail, just like the non-movable allocation can't fallback to ZONE_MOVABLE even though there is enough memory for that zone, right? > > Policy zones can be dynamically resized by offlining pages in one of > them and onlining those pages in another of them. Note that this is > only done among policy zones, not between a policy zone and a physical > zone, since resizing is a (software) policy, not a physical > characteristic. > > Implementing the same idea in the pageblock granularity has also been > explored but rejected at Google. Pageblocks have a finer granularity > and therefore can be more flexible than zones. The tradeoff is that > this alternative implementation was more complex and failed to bring a > better ROI. However, the rejection was mainly due to its inability to > be smoothly extended to 1GB THPs [2], which is a planned use case of > TAO. > > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com= / > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ > > Signed-off-by: Yu Zhao > --- > .../admin-guide/kernel-parameters.txt | 10 + > drivers/virtio/virtio_mem.c | 2 +- > include/linux/gfp.h | 24 +- > include/linux/huge_mm.h | 6 - > include/linux/mempolicy.h | 2 +- > include/linux/mmzone.h | 52 +- > include/linux/nodemask.h | 2 +- > include/linux/vm_event_item.h | 2 +- > include/trace/events/mmflags.h | 4 +- > mm/compaction.c | 12 + > mm/huge_memory.c | 5 +- > mm/mempolicy.c | 14 +- > mm/migrate.c | 7 +- > mm/mm_init.c | 452 ++++++++++-------- > mm/page_alloc.c | 44 +- > mm/page_isolation.c | 2 +- > mm/swap_slots.c | 3 +- > mm/vmscan.c | 32 +- > mm/vmstat.c | 7 +- > 19 files changed, 431 insertions(+), 251 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentat= ion/admin-guide/kernel-parameters.txt > index 31b3a25680d0..a6c181f6efde 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3529,6 +3529,16 @@ > allocations which rules out almost all kernel > allocations. Use with caution! > > + nosplit=3DX,Y [MM] Set the minimum order of the nosplit zone.= Pages in > + this zone can't be split down below order Y, whil= e free > + or in use. > + Like movablecore, X should be either nn[KMGTPE] o= r n%. > + > + nomerge=3DX,Y [MM] Set the exact orders of the nomerge zone. = Pages in > + this zone are always order Y, meaning they can't = be > + split or merged while free or in use. > + Like movablecore, X should be either nn[KMGTPE] o= r n%. > + > MTD_Partition=3D [MTD] > Format: ,,, > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index 8e3223294442..37ecf5ee4afd 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct vir= tio_mem *vm, > page =3D pfn_to_online_page(pfn); > if (!page) > continue; > - if (page_zonenum(page) !=3D ZONE_MOVABLE) > + if (!is_zone_movable_page(page)) > return false; > } > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index de292a007138..c0f9d21b4d18 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t = gfp_flags) > * GFP_ZONES_SHIFT must be <=3D 2 on 32 bit platforms. > */ > > -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <=3D 4 > -/* ZONE_DEVICE is not a valid GFP zone specifier */ > +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <=3D 4 > +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ > #define GFP_ZONES_SHIFT 2 > #else > #define GFP_ZONES_SHIFT ZONES_SHIFT > @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) > z =3D (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & > ((1 << GFP_ZONES_SHIFT) - 1); > VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); > + > + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) =3D=3D (__GFP_MOVABLE = | __GFP_COMP)) > + return LAST_VIRT_ZONE; > + > return z; > } > > +extern int zone_nomerge_order __read_mostly; > +extern int zone_nosplit_order __read_mostly; > + > +static inline enum zone_type gfp_order_zone(gfp_t flags, int order) > +{ > + enum zone_type zid =3D gfp_zone(flags); > + > + if (zid >=3D ZONE_NOMERGE && order !=3D zone_nomerge_order) > + zid =3D ZONE_NOMERGE - 1; > + > + if (zid >=3D ZONE_NOSPLIT && order < zone_nosplit_order) > + zid =3D ZONE_NOSPLIT - 1; > + > + return zid; > +} > + > /* > * There is only one page-allocator function, and two main namespaces to > * it. The alloc_page*() variants return 'struct page *' and as such > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..9960ad7c3b10 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp= , unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long fla= gs); > > void folio_prep_large_rmappable(struct folio *folio); > -bool can_split_folio(struct folio *folio, int *pextra_pins); > int split_huge_page_to_list(struct page *page, struct list_head *list); > static inline int split_huge_page(struct page *page) > { > @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct= folio *folio) {} > > #define thp_get_unmapped_area NULL > > -static inline bool > -can_split_folio(struct folio *folio, int *pextra_pins) > -{ > - return false; > -} > static inline int > split_huge_page_to_list(struct page *page, struct list_head *list) > { > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..a92bcf47cf8c 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -150,7 +150,7 @@ extern enum zone_type policy_zone; > > static inline void check_highest_zone(enum zone_type k) > { > - if (k > policy_zone && k !=3D ZONE_MOVABLE) > + if (k > policy_zone && !zid_is_virt(k)) > policy_zone =3D k; > } > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index a497f189d988..532218167bba 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -805,11 +805,15 @@ enum zone_type { > * there can be false negatives). > */ > ZONE_MOVABLE, > + ZONE_NOSPLIT, > + ZONE_NOMERGE, > #ifdef CONFIG_ZONE_DEVICE > ZONE_DEVICE, > #endif > - __MAX_NR_ZONES > + __MAX_NR_ZONES, > > + LAST_PHYS_ZONE =3D ZONE_MOVABLE - 1, > + LAST_VIRT_ZONE =3D ZONE_NOMERGE, > }; > > #ifndef __GENERATING_BOUNDS_H > @@ -929,6 +933,8 @@ struct zone { > seqlock_t span_seqlock; > #endif > > + int order; > + > int initialized; > > /* Write-intensive fields used from the page allocator */ > @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const str= uct folio *folio) > > static inline bool is_zone_movable_page(const struct page *page) > { > - return page_zonenum(page) =3D=3D ZONE_MOVABLE; > + return page_zonenum(page) >=3D ZONE_MOVABLE; > } > > static inline bool folio_is_zone_movable(const struct folio *folio) > { > - return folio_zonenum(folio) =3D=3D ZONE_MOVABLE; > + return folio_zonenum(folio) >=3D ZONE_MOVABLE; > +} > + > +static inline bool page_can_split(struct page *page) > +{ > + return page_zonenum(page) < ZONE_NOSPLIT; > +} > + > +static inline bool folio_can_split(struct folio *folio) > +{ > + return folio_zonenum(folio) < ZONE_NOSPLIT; > } > #endif > > @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) {= return node_id; }; > */ > #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > > +static inline bool zid_is_virt(enum zone_type zid) > +{ > + return zid > LAST_PHYS_ZONE && zid <=3D LAST_VIRT_ZONE; > +} > + > +static inline bool zone_can_frag(struct zone *zone) > +{ > + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT); > + > + return zone_idx(zone) < ZONE_NOSPLIT; > +} > + > +static inline bool zone_is_suitable(struct zone *zone, int order) > +{ > + int zid =3D zone_idx(zone); > + > + if (zid < ZONE_NOSPLIT) > + return true; > + > + if (!zone->order) > + return false; > + > + return (zid =3D=3D ZONE_NOSPLIT && order >=3D zone->order) || > + (zid =3D=3D ZONE_NOMERGE && order =3D=3D zone->order); > +} > + > #ifdef CONFIG_ZONE_DEVICE > static inline bool zone_is_zone_device(struct zone *zone) > { > @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone) > static inline void zone_set_nid(struct zone *zone, int nid) {} > #endif > > -extern int movable_zone; > +extern int virt_zone; > > static inline int is_highmem_idx(enum zone_type idx) > { > #ifdef CONFIG_HIGHMEM > return (idx =3D=3D ZONE_HIGHMEM || > - (idx =3D=3D ZONE_MOVABLE && movable_zone =3D=3D ZONE_HIGH= MEM)); > + (zid_is_virt(idx) && virt_zone =3D=3D ZONE_HIGHMEM)); > #else > return 0; > #endif > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index b61438313a73..34fbe910576d 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -404,7 +404,7 @@ enum node_states { > #else > N_HIGH_MEMORY =3D N_NORMAL_MEMORY, > #endif > - N_MEMORY, /* The node has memory(regular, high, mov= able) */ > + N_MEMORY, /* The node has memory in any of the zone= s */ > N_CPU, /* The node has one or more cpus */ > N_GENERIC_INITIATOR, /* The node has one or more Generic Initi= ators */ > NR_NODE_STATES > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.= h > index 747943bc8cc2..9a54d15d5ec3 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -27,7 +27,7 @@ > #endif > > #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \ > - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx) > + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE= _ZONE(xx) > > enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > FOR_ALL_ZONES(PGALLOC) > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflag= s.h > index d801409b33cf..2b5fdafaadea 100644 > --- a/include/trace/events/mmflags.h > +++ b/include/trace/events/mmflags.h > @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" )= \ > IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ > EM (ZONE_NORMAL, "Normal") \ > IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ > - EMe(ZONE_MOVABLE,"Movable") > + EM (ZONE_MOVABLE,"Movable") \ > + EM (ZONE_NOSPLIT,"NoSplit") \ > + EMe(ZONE_NOMERGE,"NoMerge") > > #define LRU_NAMES \ > EM (LRU_INACTIVE_ANON, "inactive_anon") \ > diff --git a/mm/compaction.c b/mm/compaction.c > index 4add68d40e8d..8a64c805f411 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_= mask, unsigned int order, > ac->highest_zoneidx, ac->nodemask= ) { > enum compact_result status; > > + if (!zone_can_frag(zone)) > + continue; > + > if (prio > MIN_COMPACT_PRIORITY > && compaction_deferred(zone, orde= r)) { > rc =3D max_t(enum compact_result, COMPACT_DEFERRE= D, rc); > @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat= ) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone =3D zone; > > compact_zone(&cc, NULL); > @@ -2846,6 +2852,9 @@ static void compact_node(int nid) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone =3D zone; > > compact_zone(&cc, NULL); > @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgda= t) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > ret =3D compaction_suit_allocation_order(zone, > pgdat->kcompactd_max_order, > highest_zoneidx, ALLOC_WMARK_MIN); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 94c958f7ebb5..b57faa0a1e83 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, = struct list_head *list, > } > > /* Racy check whether the huge page can be split */ > -bool can_split_folio(struct folio *folio, int *pextra_pins) > +static bool can_split_folio(struct folio *folio, int *pextra_pins) > { > int extra_pins; > > + if (!folio_can_split(folio)) > + return false; > + > /* Additional pins from page cache */ > if (folio_test_anon(folio)) > extra_pins =3D folio_test_swapcache(folio) ? > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 10a590ee1c89..1f84dd759086 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma) > > bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > { > - enum zone_type dynamic_policy_zone =3D policy_zone; > - > - BUG_ON(dynamic_policy_zone =3D=3D ZONE_MOVABLE); > + WARN_ON_ONCE(zid_is_virt(policy_zone)); > > /* > - * if policy->nodes has movable memory only, > - * we apply policy when gfp_zone(gfp) =3D ZONE_MOVABLE only. > + * If policy->nodes has memory in virtual zones only, we apply po= licy > + * only if gfp_zone(gfp) can allocate from those zones. > * > * policy->nodes is intersect with node_states[N_MEMORY]. > * so if the following test fails, it implies > - * policy->nodes has movable memory only. > + * policy->nodes has memory in virtual zones only. > */ > if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) > - dynamic_policy_zone =3D ZONE_MOVABLE; > + return zone > LAST_PHYS_ZONE; > > - return zone >=3D dynamic_policy_zone; > + return zone >=3D policy_zone; > } > > /* Do dynamic interleaving for a process */ > diff --git a/mm/migrate.c b/mm/migrate.c > index cc9f2bcd73b4..f615c0c22046 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *fol= io, struct list_head *split_f > { > int rc; > > + if (!folio_can_split(folio)) > + return -EBUSY; > + > folio_lock(folio); > rc =3D split_folio_to_list(folio, split_folios); > folio_unlock(folio); > @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *= src, unsigned long private) > order =3D folio_order(src); > } > zidx =3D zone_idx(folio_zone(src)); > - if (is_highmem_idx(zidx) || zidx =3D=3D ZONE_MOVABLE) > + if (zidx > ZONE_NORMAL) > gfp_mask |=3D __GFP_HIGHMEM; > > return __folio_alloc(gfp_mask, order, nid, mtc->nmask); > @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgd= at, struct folio *folio) > break; > } > wakeup_kswapd(pgdat->node_zones + z, 0, > - folio_order(folio), ZONE_MOVABLE); > + folio_order(folio), z); > return 0; > } > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 2c19f5515e36..7769c21e6d54 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init); > > static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initd= ata; > static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __init= data; > -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata; > > -static unsigned long required_kernelcore __initdata; > -static unsigned long required_kernelcore_percent __initdata; > -static unsigned long required_movablecore __initdata; > -static unsigned long required_movablecore_percent __initdata; > +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUM= NODES] __initdata; > +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid]) > + > +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] = __initdata; > +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE]) > + > +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1= ] __initdata; > +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE]) > + > +int zone_nosplit_order __read_mostly; > +int zone_nomerge_order __read_mostly; > > static unsigned long nr_kernel_pages __initdata; > static unsigned long nr_all_pages __initdata; > @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p) > return 0; > } > > - return cmdline_parse_core(p, &required_kernelcore, > - &required_kernelcore_percent); > + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE), > + &percentage_of(LAST_PHYS_ZONE)); > } > early_param("kernelcore", cmdline_parse_kernelcore); > > @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore)= ; > */ > static int __init cmdline_parse_movablecore(char *p) > { > - return cmdline_parse_core(p, &required_movablecore, > - &required_movablecore_percent); > + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE), > + &percentage_of(ZONE_MOVABLE)); > } > early_param("movablecore", cmdline_parse_movablecore); > > +static int __init parse_zone_order(char *p, unsigned long *nr_pages, > + unsigned long *percent, int *order) > +{ > + int err; > + unsigned long n; > + char *s =3D strchr(p, ','); > + > + if (!s) > + return -EINVAL; > + > + *s++ =3D '\0'; > + > + err =3D kstrtoul(s, 0, &n); > + if (err) > + return err; > + > + if (n < 2 || n > MAX_PAGE_ORDER) > + return -EINVAL; > + > + err =3D cmdline_parse_core(p, nr_pages, percent); > + if (err) > + return err; > + > + *order =3D n; > + > + return 0; > +} > + > +static int __init parse_zone_nosplit(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT), > + &percentage_of(ZONE_NOSPLIT), &zone_nospl= it_order); > +} > +early_param("nosplit", parse_zone_nosplit); > + > +static int __init parse_zone_nomerge(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE), > + &percentage_of(ZONE_NOMERGE), &zone_nomer= ge_order); > +} > +early_param("nomerge", parse_zone_nomerge); > + > /* > * early_calculate_totalpages() > - * Sum pages in active regions for movable zone. > + * Sum pages in active regions for virtual zones. > * Populate N_MEMORY for calculating usable_nodes. > */ > static unsigned long __init early_calculate_totalpages(void) > @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalp= ages(void) > } > > /* > - * This finds a zone that can be used for ZONE_MOVABLE pages. The > + * This finds a physical zone that can be used for virtual zones. The > * assumption is made that zones within a node are ordered in monotonic > * increasing memory addresses so that the "highest" populated zone is u= sed > */ > -static void __init find_usable_zone_for_movable(void) > +static void __init find_usable_zone(void) > { > int zone_index; > - for (zone_index =3D MAX_NR_ZONES - 1; zone_index >=3D 0; zone_ind= ex--) { > - if (zone_index =3D=3D ZONE_MOVABLE) > - continue; > - > + for (zone_index =3D LAST_PHYS_ZONE; zone_index >=3D 0; zone_index= --) { > if (arch_zone_highest_possible_pfn[zone_index] > > arch_zone_lowest_possible_pfn[zone_index]= ) > break; > } > > VM_BUG_ON(zone_index =3D=3D -1); > - movable_zone =3D zone_index; > + virt_zone =3D zone_index; > +} > + > +static void __init find_virt_zone(unsigned long occupied, unsigned long = *zone_pfn) > +{ > + int i, nid; > + unsigned long node_avg, remaining; > + int usable_nodes =3D nodes_weight(node_states[N_MEMORY]); > + /* usable_startpfn is the lowest possible pfn virtual zones can b= e at */ > + unsigned long usable_startpfn =3D arch_zone_lowest_possible_pfn[v= irt_zone]; > + > +restart: > + /* Carve out memory as evenly as possible throughout nodes */ > + node_avg =3D occupied / usable_nodes; > + for_each_node_state(nid, N_MEMORY) { > + unsigned long start_pfn, end_pfn; > + > + /* > + * Recalculate node_avg if the division per node now exce= eds > + * what is necessary to satisfy the amount of memory to c= arve > + * out. > + */ > + if (occupied < node_avg) > + node_avg =3D occupied / usable_nodes; > + > + /* > + * As the map is walked, we track how much memory is usab= le > + * using remaining. When it is 0, the rest of the node is > + * usable. > + */ > + remaining =3D node_avg; > + > + /* Go through each range of PFNs within this node */ > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL= ) { > + unsigned long size_pages; > + > + start_pfn =3D max(start_pfn, zone_pfn[nid]); > + if (start_pfn >=3D end_pfn) > + continue; > + > + /* Account for what is only usable when carving o= ut */ > + if (start_pfn < usable_startpfn) { > + unsigned long nr_pages =3D min(end_pfn, u= sable_startpfn) - start_pfn; > + > + remaining -=3D min(nr_pages, remaining); > + occupied -=3D min(nr_pages, occupied); > + > + /* Continue if range is now fully account= ed */ > + if (end_pfn <=3D usable_startpfn) { > + > + /* > + * Push zone_pfn to the end so th= at if > + * we have to carve out more acro= ss > + * nodes, we will not double acco= unt > + * here. > + */ > + zone_pfn[nid] =3D end_pfn; > + continue; > + } > + start_pfn =3D usable_startpfn; > + } > + > + /* > + * The usable PFN range is from start_pfn->end_pf= n. > + * Calculate size_pages as the number of pages us= ed. > + */ > + size_pages =3D end_pfn - start_pfn; > + if (size_pages > remaining) > + size_pages =3D remaining; > + zone_pfn[nid] =3D start_pfn + size_pages; > + > + /* > + * Some memory was carved out, update counts and = break > + * if the request for this node has been satisfie= d. > + */ > + occupied -=3D min(occupied, size_pages); > + remaining -=3D size_pages; > + if (!remaining) > + break; > + } > + } > + > + /* > + * If there is still more to carve out, we do another pass with o= ne less > + * node in the count. This will push zone_pfn[nid] further along = on the > + * nodes that still have memory until the request is fully satisf= ied. > + */ > + usable_nodes--; > + if (usable_nodes && occupied > usable_nodes) > + goto restart; > } > > /* > @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(voi= d) > * memory. When they don't, some nodes will have more kernelcore than > * others > */ > -static void __init find_zone_movable_pfns_for_nodes(void) > +static void __init find_virt_zones(void) > { > - int i, nid; > + int i; > + int nid; > unsigned long usable_startpfn; > - unsigned long kernelcore_node, kernelcore_remaining; > /* save the state before borrow the nodemask */ > nodemask_t saved_node_state =3D node_states[N_MEMORY]; > unsigned long totalpages =3D early_calculate_totalpages(); > - int usable_nodes =3D nodes_weight(node_states[N_MEMORY]); > struct memblock_region *r; > + unsigned long occupied =3D 0; > > - /* Need to find movable_zone earlier when movable_node is specifi= ed. */ > - find_usable_zone_for_movable(); > + /* Need to find virt_zone earlier when movable_node is specified.= */ > + find_usable_zone(); > > /* > * If movable_node is specified, ignore kernelcore and movablecor= e > @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(v= oid) > nid =3D memblock_get_region_node(r); > > usable_startpfn =3D PFN_DOWN(r->base); > - zone_movable_pfn[nid] =3D zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid= ]) : > + pfn_of(ZONE_MOVABLE, nid) =3D pfn_of(ZONE_MOVABLE= , nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE,= nid)) : > usable_startpfn; > } > > @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(v= oid) > continue; > } > > - zone_movable_pfn[nid] =3D zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid= ]) : > + pfn_of(ZONE_MOVABLE, nid) =3D pfn_of(ZONE_MOVABLE= , nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE,= nid)) : > usable_startpfn; > } > > @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_node= s(void) > goto out2; > } > > + if (zone_nomerge_order && zone_nomerge_order <=3D zone_nosplit_or= der) { > + nr_pages_of(ZONE_NOSPLIT) =3D nr_pages_of(ZONE_NOMERGE) = =3D 0; > + percentage_of(ZONE_NOSPLIT) =3D percentage_of(ZONE_NOMERG= E) =3D 0; > + zone_nosplit_order =3D zone_nomerge_order =3D 0; > + pr_warn("zone %s order %d must be higher zone %s order %d= \n", > + zone_names[ZONE_NOMERGE], zone_nomerge_order, > + zone_names[ZONE_NOSPLIT], zone_nosplit_order); > + } > + > /* > * If kernelcore=3Dnn% or movablecore=3Dnn% was specified, calcul= ate the > * amount of necessary memory. > */ > - if (required_kernelcore_percent) > - required_kernelcore =3D (totalpages * 100 * required_kern= elcore_percent) / > - 10000UL; > - if (required_movablecore_percent) > - required_movablecore =3D (totalpages * 100 * required_mov= ablecore_percent) / > - 10000UL; > + for (i =3D LAST_PHYS_ZONE; i <=3D LAST_VIRT_ZONE; i++) { > + if (percentage_of(i)) > + nr_pages_of(i) =3D totalpages * percentage_of(i) = / 100; > + > + nr_pages_of(i) =3D roundup(nr_pages_of(i), MAX_ORDER_NR_P= AGES); > + occupied +=3D nr_pages_of(i); > + } > > /* > * If movablecore=3D was specified, calculate what size of > * kernelcore that corresponds so that memory usable for > * any allocation type is evenly spread. If both kernelcore > * and movablecore are specified, then the value of kernelcore > - * will be used for required_kernelcore if it's greater than > - * what movablecore would have allowed. > + * will be used if it's greater than what movablecore would have > + * allowed. > */ > - if (required_movablecore) { > - unsigned long corepages; > + if (occupied < totalpages) { > + enum zone_type zid; > > - /* > - * Round-up so that ZONE_MOVABLE is at least as large as = what > - * was requested by the user > - */ > - required_movablecore =3D > - roundup(required_movablecore, MAX_ORDER_NR_PAGES)= ; > - required_movablecore =3D min(totalpages, required_movable= core); > - corepages =3D totalpages - required_movablecore; > - > - required_kernelcore =3D max(required_kernelcore, corepage= s); > + zid =3D !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_= MOVABLE) ? > + LAST_PHYS_ZONE : ZONE_MOVABLE; > + nr_pages_of(zid) +=3D totalpages - occupied; > } > > /* > * If kernelcore was not specified or kernelcore size is larger > - * than totalpages, there is no ZONE_MOVABLE. > + * than totalpages, there are not virtual zones. > */ > - if (!required_kernelcore || required_kernelcore >=3D totalpages) > + occupied =3D nr_pages_of(LAST_PHYS_ZONE); > + if (!occupied || occupied >=3D totalpages) > goto out; > > - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be= at */ > - usable_startpfn =3D arch_zone_lowest_possible_pfn[movable_zone]; > + for (i =3D LAST_PHYS_ZONE + 1; i <=3D LAST_VIRT_ZONE; i++) { > + if (!nr_pages_of(i)) > + continue; > > -restart: > - /* Spread kernelcore memory as evenly as possible throughout node= s */ > - kernelcore_node =3D required_kernelcore / usable_nodes; > - for_each_node_state(nid, N_MEMORY) { > - unsigned long start_pfn, end_pfn; > - > - /* > - * Recalculate kernelcore_node if the division per node > - * now exceeds what is necessary to satisfy the requested > - * amount of memory for the kernel > - */ > - if (required_kernelcore < kernelcore_node) > - kernelcore_node =3D required_kernelcore / usable_= nodes; > - > - /* > - * As the map is walked, we track how much memory is usab= le > - * by the kernel using kernelcore_remaining. When it is > - * 0, the rest of the node is usable by ZONE_MOVABLE > - */ > - kernelcore_remaining =3D kernelcore_node; > - > - /* Go through each range of PFNs within this node */ > - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL= ) { > - unsigned long size_pages; > - > - start_pfn =3D max(start_pfn, zone_movable_pfn[nid= ]); > - if (start_pfn >=3D end_pfn) > - continue; > - > - /* Account for what is only usable for kernelcore= */ > - if (start_pfn < usable_startpfn) { > - unsigned long kernel_pages; > - kernel_pages =3D min(end_pfn, usable_star= tpfn) > - - start_p= fn; > - > - kernelcore_remaining -=3D min(kernel_page= s, > - kernelcore_remain= ing); > - required_kernelcore -=3D min(kernel_pages= , > - required_kernelco= re); > - > - /* Continue if range is now fully account= ed */ > - if (end_pfn <=3D usable_startpfn) { > - > - /* > - * Push zone_movable_pfn to the e= nd so > - * that if we have to rebalance > - * kernelcore across nodes, we wi= ll > - * not double account here > - */ > - zone_movable_pfn[nid] =3D end_pfn= ; > - continue; > - } > - start_pfn =3D usable_startpfn; > - } > - > - /* > - * The usable PFN range for ZONE_MOVABLE is from > - * start_pfn->end_pfn. Calculate size_pages as th= e > - * number of pages used as kernelcore > - */ > - size_pages =3D end_pfn - start_pfn; > - if (size_pages > kernelcore_remaining) > - size_pages =3D kernelcore_remaining; > - zone_movable_pfn[nid] =3D start_pfn + size_pages; > - > - /* > - * Some kernelcore has been met, update counts an= d > - * break if the kernelcore for this node has been > - * satisfied > - */ > - required_kernelcore -=3D min(required_kernelcore, > - size_page= s); > - kernelcore_remaining -=3D size_pages; > - if (!kernelcore_remaining) > - break; > - } > + find_virt_zone(occupied, &pfn_of(i, 0)); > + occupied +=3D nr_pages_of(i); > } > - > - /* > - * If there is still required_kernelcore, we do another pass with= one > - * less node in the count. This will push zone_movable_pfn[nid] f= urther > - * along on the nodes that still have memory until kernelcore is > - * satisfied > - */ > - usable_nodes--; > - if (usable_nodes && required_kernelcore > usable_nodes) > - goto restart; > - > out2: > - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES = */ > + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGE= S */ > for (nid =3D 0; nid < MAX_NUMNODES; nid++) { > unsigned long start_pfn, end_pfn; > - > - zone_movable_pfn[nid] =3D > - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES= ); > + unsigned long prev_virt_zone_pfn =3D 0; > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > - if (zone_movable_pfn[nid] >=3D end_pfn) > - zone_movable_pfn[nid] =3D 0; > + > + for (i =3D LAST_PHYS_ZONE + 1; i <=3D LAST_VIRT_ZONE; i++= ) { > + pfn_of(i, nid) =3D roundup(pfn_of(i, nid), MAX_OR= DER_NR_PAGES); > + > + if (pfn_of(i, nid) <=3D prev_virt_zone_pfn || pfn= _of(i, nid) >=3D end_pfn) > + pfn_of(i, nid) =3D 0; > + > + if (pfn_of(i, nid)) > + prev_virt_zone_pfn =3D pfn_of(i, nid); > + } > } > - > out: > /* restore the node_state */ > node_states[N_MEMORY] =3D saved_node_state; > @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *z= one, > #endif > > /* > - * The zone ranges provided by the architecture do not include ZONE_MOVA= BLE > - * because it is sized independent of architecture. Unlike the other zon= es, > - * the starting point for ZONE_MOVABLE is not fixed. It may be different > - * in each node depending on the size of each node and how evenly kernel= core > - * is distributed. This helper function adjusts the zone ranges > + * The zone ranges provided by the architecture do not include virtual z= ones > + * because they are sized independent of architecture. Unlike physical z= ones, > + * the starting point for the first populated virtual zone is not fixed.= It may > + * be different in each node depending on the size of each node and how = evenly > + * kernelcore is distributed. This helper function adjusts the zone rang= es > * provided by the architecture for a given node by using the end of the > - * highest usable zone for ZONE_MOVABLE. This preserves the assumption t= hat > - * zones within a node are in order of monotonic increases memory addres= ses > + * highest usable zone for the first populated virtual zone. This preser= ves the > + * assumption that zones within a node are in order of monotonic increas= es > + * memory addresses. > */ > -static void __init adjust_zone_range_for_zone_movable(int nid, > +static void __init adjust_zone_range(int nid, > unsigned long zone_type, > unsigned long node_end_pfn, > unsigned long *zone_start_pfn, > unsigned long *zone_end_pfn) > { > - /* Only adjust if ZONE_MOVABLE is on this node */ > - if (zone_movable_pfn[nid]) { > - /* Size ZONE_MOVABLE */ > - if (zone_type =3D=3D ZONE_MOVABLE) { > - *zone_start_pfn =3D zone_movable_pfn[nid]; > - *zone_end_pfn =3D min(node_end_pfn, > - arch_zone_highest_possible_pfn[movable_zo= ne]); > + int i =3D max_t(int, zone_type, LAST_PHYS_ZONE); > + unsigned long next_virt_zone_pfn =3D 0; > > - /* Adjust for ZONE_MOVABLE starting within this range */ > - } else if (!mirrored_kernelcore && > - *zone_start_pfn < zone_movable_pfn[nid] && > - *zone_end_pfn > zone_movable_pfn[nid]) { > - *zone_end_pfn =3D zone_movable_pfn[nid]; > + while (i++ < LAST_VIRT_ZONE) { > + if (pfn_of(i, nid)) { > + next_virt_zone_pfn =3D pfn_of(i, nid); > + break; > + } > + } > > - /* Check if this whole range is within ZONE_MOVABLE */ > - } else if (*zone_start_pfn >=3D zone_movable_pfn[nid]) > + if (zone_type <=3D LAST_PHYS_ZONE) { > + if (!next_virt_zone_pfn) > + return; > + > + if (!mirrored_kernelcore && > + *zone_start_pfn < next_virt_zone_pfn && > + *zone_end_pfn > next_virt_zone_pfn) > + *zone_end_pfn =3D next_virt_zone_pfn; > + else if (*zone_start_pfn >=3D next_virt_zone_pfn) > *zone_start_pfn =3D *zone_end_pfn; > + } else if (zone_type <=3D LAST_VIRT_ZONE) { > + if (!pfn_of(zone_type, nid)) > + return; > + > + if (next_virt_zone_pfn) > + *zone_end_pfn =3D min3(next_virt_zone_pfn, > + node_end_pfn, > + arch_zone_highest_possible_p= fn[virt_zone]); > + else > + *zone_end_pfn =3D min(node_end_pfn, > + arch_zone_highest_possible_pf= n[virt_zone]); > + *zone_start_pfn =3D min(*zone_end_pfn, pfn_of(zone_type, = nid)); > } > } > > @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_no= de(int nid, > * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages > * and vice versa. > */ > - if (mirrored_kernelcore && zone_movable_pfn[nid]) { > + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) { > unsigned long start_pfn, end_pfn; > struct memblock_region *r; > > @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_n= ode(int nid, > /* Get the start and end of the zone */ > *zone_start_pfn =3D clamp(node_start_pfn, zone_low, zone_high); > *zone_end_pfn =3D clamp(node_end_pfn, zone_low, zone_high); > - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, > - zone_start_pfn, zone_end_pfn); > + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, z= one_end_pfn); > > /* Check that this node has pages within the zone's required rang= e */ > if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_= pfn) > @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struc= t pglist_data *pgdat, > #if defined(CONFIG_MEMORY_HOTPLUG) > zone->present_early_pages =3D real_size; > #endif > + if (i =3D=3D ZONE_NOSPLIT) > + zone->order =3D zone_nosplit_order; > + if (i =3D=3D ZONE_NOMERGE) > + zone->order =3D zone_nomerge_order; > > totalpages +=3D spanned; > realtotalpages +=3D real_size; > @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgda= t) > { > enum zone_type zone_type; > > - for (zone_type =3D 0; zone_type <=3D ZONE_MOVABLE - 1; zone_type+= +) { > + for (zone_type =3D 0; zone_type <=3D LAST_PHYS_ZONE; zone_type++)= { > struct zone *zone =3D &pgdat->node_zones[zone_type]; > if (populated_zone(zone)) { > if (IS_ENABLED(CONFIG_HIGHMEM)) > @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void) > void __init free_area_init(unsigned long *max_zone_pfn) > { > unsigned long start_pfn, end_pfn; > - int i, nid, zone; > + int i, j, nid, zone; > bool descending; > > /* Record where the zone boundaries are */ > @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zon= e_pfn) > start_pfn =3D PHYS_PFN(memblock_start_of_DRAM()); > descending =3D arch_has_descending_max_zone_pfns(); > > - for (i =3D 0; i < MAX_NR_ZONES; i++) { > + for (i =3D 0; i <=3D LAST_PHYS_ZONE; i++) { > if (descending) > - zone =3D MAX_NR_ZONES - i - 1; > + zone =3D LAST_PHYS_ZONE - i; > else > zone =3D i; > > - if (zone =3D=3D ZONE_MOVABLE) > - continue; > - > end_pfn =3D max(max_zone_pfn[zone], start_pfn); > arch_zone_lowest_possible_pfn[zone] =3D start_pfn; > arch_zone_highest_possible_pfn[zone] =3D end_pfn; > @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zon= e_pfn) > start_pfn =3D end_pfn; > } > > - /* Find the PFNs that ZONE_MOVABLE begins at in each node */ > - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); > - find_zone_movable_pfns_for_nodes(); > + /* Find the PFNs that virtual zones begin at in each node */ > + find_virt_zones(); > > /* Print out the zone ranges */ > pr_info("Zone ranges:\n"); > - for (i =3D 0; i < MAX_NR_ZONES; i++) { > - if (i =3D=3D ZONE_MOVABLE) > - continue; > + for (i =3D 0; i <=3D LAST_PHYS_ZONE; i++) { > pr_info(" %-8s ", zone_names[i]); > if (arch_zone_lowest_possible_pfn[i] =3D=3D > arch_zone_highest_possible_pfn[i]) > @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zon= e_pfn) > << PAGE_SHIFT) - 1); > } > > - /* Print out the PFNs ZONE_MOVABLE begins at in each node */ > - pr_info("Movable zone start for each node\n"); > - for (i =3D 0; i < MAX_NUMNODES; i++) { > - if (zone_movable_pfn[i]) > - pr_info(" Node %d: %#018Lx\n", i, > - (u64)zone_movable_pfn[i] << PAGE_SHIFT); > + /* Print out the PFNs virtual zones begin at in each node */ > + for (; i <=3D LAST_VIRT_ZONE; i++) { > + pr_info("%s zone start for each node\n", zone_names[i]); > + for (j =3D 0; j < MAX_NUMNODES; j++) { > + if (pfn_of(i, j)) > + pr_info(" Node %d: %#018Lx\n", > + j, (u64)pfn_of(i, j) << PAGE_SHIF= T); > + } > } > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 150d4f23b010..6a4da8f8691c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] =3D { > "HighMem", > #endif > "Movable", > + "NoSplit", > + "NoMerge", > #ifdef CONFIG_ZONE_DEVICE > "Device", > #endif > @@ -290,9 +292,9 @@ int user_min_free_kbytes =3D -1; > static int watermark_boost_factor __read_mostly =3D 15000; > static int watermark_scale_factor =3D 10; > > -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from = */ > -int movable_zone; > -EXPORT_SYMBOL(movable_zone); > +/* virt_zone is the "real" zone pages in virtual zones are taken from */ > +int virt_zone; > +EXPORT_SYMBOL(virt_zone); > > #if MAX_NUMNODES > 1 > unsigned int nr_node_ids __read_mostly =3D MAX_NUMNODES; > @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long b= uddy_pfn, > unsigned long higher_page_pfn; > struct page *higher_page; > > - if (order >=3D MAX_PAGE_ORDER - 1) > - return false; > - > higher_page_pfn =3D buddy_pfn & pfn; > higher_page =3D page + (higher_page_pfn - pfn); > > @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long = buddy_pfn, > NULL) !=3D NULL; > } > > +static int zone_max_order(struct zone *zone) > +{ > + return zone->order && zone_idx(zone) =3D=3D ZONE_NOMERGE ? zone->= order : MAX_PAGE_ORDER; > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + int max_order =3D zone_max_order(zone); > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); > VM_BUG_ON_PAGE(bad_range(zone, page), page); > > - while (order < MAX_PAGE_ORDER) { > + while (order < max_order) { > if (compaction_capture(capc, page, order, migratetype)) { > __mod_zone_freepage_state(zone, -(1 << order), > migratety= pe); > @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page, > to_tail =3D true; > else if (is_shuffle_order(order)) > to_tail =3D shuffle_pick_tail(); > + else if (order + 1 >=3D max_order) > + to_tail =3D false; > else > to_tail =3D buddy_merge_likely(pfn, buddy_pfn, page, orde= r); > > @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page, > int mt; > int ret =3D 0; > > + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page); > + > if (split_pfn_offset =3D=3D 0) > return ret; > > @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, = unsigned int order, > struct free_area *area; > struct page *page; > > + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order)); > + > /* Find a page of the appropriate size in the preferred list */ > for (current_order =3D order; current_order < NR_PAGE_ORDERS; ++c= urrent_order) { > area =3D &(zone->free_area[current_order]); > @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned i= nt order, unsigned long mark, > long min =3D mark; > int o; > > + if (!zone_is_suitable(z, order)) > + return false; > + > /* free_pages may go negative - that's OK */ > free_pages -=3D __zone_watermark_unusable_free(z, order, alloc_fl= ags); > > @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone = *z, unsigned int order, > { > long free_pages; > > + if (!zone_is_suitable(z, order)) > + return false; > + > free_pages =3D zone_page_state(z, NR_FREE_PAGES); > > /* > @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int= order, int alloc_flags, > struct page *page; > unsigned long mark; > > + if (!zone_is_suitable(zone, order)) > + continue; > + > if (cpusets_enabled() && > (alloc_flags & ALLOC_CPUSET) && > !__cpuset_zone_allowed(zone, gfp_mask)) > @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void) > struct zone *zone; > unsigned long flags; > > - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE page= s */ > + /* Calculate total number of pages below ZONE_HIGHMEM */ > for_each_zone(zone) { > - if (!is_highmem(zone) && zone_idx(zone) !=3D ZONE_MOVABLE= ) > + if (zone_idx(zone) <=3D ZONE_NORMAL) > lowmem_pages +=3D zone_managed_pages(zone); > } > > @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void) > spin_lock_irqsave(&zone->lock, flags); > tmp =3D (u64)pages_min * zone_managed_pages(zone); > do_div(tmp, lowmem_pages); > - if (is_highmem(zone) || zone_idx(zone) =3D=3D ZONE_MOVABL= E) { > + if (zone_idx(zone) > ZONE_NORMAL) { > /* > * __GFP_HIGH and PF_MEMALLOC allocations usually= don't > - * need highmem and movable zones pages, so cap p= ages_min > - * to a small value here. > + * need pages from zones above ZONE_NORMAL, so ca= p > + * pages_min to a small value here. > * > * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_= MIN) > * deltas control async page reclaim, and so shou= ld > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index cd0ea3668253..8a6473543427 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long s= tart_pfn, unsigned long e > * pages then it should be reasonably safe to assume the = rest > * is movable. > */ > - if (zone_idx(zone) =3D=3D ZONE_MOVABLE) > + if (zid_is_virt(zone_idx(zone))) > continue; > > /* > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 0bec1f705f8e..ad0db0373b05 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio) > entry.val =3D 0; > > if (folio_test_large(folio)) { > - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported= ()) > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported= () && > + folio_test_pmd_mappable(folio)) > get_swap_pages(1, &entry, folio_nr_pages(folio)); > goto out; > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4f9c854ce6cc..ae061ec4866a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list= _head *folio_list, > goto keep_locked; > if (folio_maybe_dma_pinned(folio)) > goto keep_locked; > - if (folio_test_large(folio)) { > - /* cannot split folio, skip it */ > - if (!can_split_folio(folio, NULL)= ) > - goto activate_locked; > - /* > - * Split folios without a PMD map= right > - * away. Chances are some or all = of the > - * tail pages can be freed withou= t IO. > - */ > - if (!folio_entire_mapcount(folio)= && > - split_folio_to_list(folio, > - folio_lis= t)) > - goto activate_locked; > - } > + /* > + * Split folios that are not fully map ri= ght > + * away. Chances are some of the tail pag= es can > + * be freed without IO. > + */ > + if (folio_test_large(folio) && > + atomic_read(&folio->_nr_pages_mapped)= < nr_pages) > + split_folio_to_list(folio, folio_= list); > if (!add_to_swap(folio)) { > if (!folio_test_large(folio)) > goto activate_locked_spli= t; > @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist,= struct scan_control *sc) > orig_mask =3D sc->gfp_mask; > if (buffer_heads_over_limit) { > sc->gfp_mask |=3D __GFP_HIGHMEM; > - sc->reclaim_idx =3D gfp_zone(sc->gfp_mask); > + sc->reclaim_idx =3D gfp_order_zone(sc->gfp_mask, sc->orde= r); > } > > for_each_zone_zonelist_nodemask(zone, z, zonelist, > @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zo= nelist, int order, > struct scan_control sc =3D { > .nr_to_reclaim =3D SWAP_CLUSTER_MAX, > .gfp_mask =3D current_gfp_context(gfp_mask), > - .reclaim_idx =3D gfp_zone(gfp_mask), > + .reclaim_idx =3D gfp_order_zone(gfp_mask, order), > .order =3D order, > .nodemask =3D nodemask, > .priority =3D DEF_PRIORITY, > @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_fl= ags, int order, > if (!cpuset_zone_allowed(zone, gfp_flags)) > return; > > + curr_idx =3D gfp_order_zone(gfp_flags, order); > + if (highest_zoneidx > curr_idx) > + highest_zoneidx =3D curr_idx; > + > pgdat =3D zone->zone_pgdat; > curr_idx =3D READ_ONCE(pgdat->kswapd_highest_zoneidx); > > @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat= , gfp_t gfp_mask, unsigned in > .may_writepage =3D !!(node_reclaim_mode & RECLAIM_WRITE), > .may_unmap =3D !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap =3D 1, > - .reclaim_idx =3D gfp_zone(gfp_mask), > + .reclaim_idx =3D gfp_order_zone(gfp_mask, order), > }; > unsigned long pflags; > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index db79935e4a54..adbd032e6a0f 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned= int order) > > #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_nor= mal", \ > TEXT_FOR_HIGHMEM(xx) xx "_movable= ", \ > + xx "_nosplit", xx "_nomerge", \ > TEXT_FOR_DEVICE(xx) > > const char * const vmstat_text[] =3D { > @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m,= pg_data_t *pgdat, > "\n spanned %lu" > "\n present %lu" > "\n managed %lu" > - "\n cma %lu", > + "\n cma %lu" > + "\n order %u", > zone_page_state(zone, NR_FREE_PAGES), > zone->watermark_boost, > min_wmark_pages(zone), > @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m,= pg_data_t *pgdat, > zone->spanned_pages, > zone->present_pages, > zone_managed_pages(zone), > - zone_cma_pages(zone)); > + zone_cma_pages(zone), > + zone->order); > > seq_printf(m, > "\n protection: (%ld", > -- > 2.44.0.rc1.240.g4c46232300-goog > >