linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
@ 2025-05-21 21:57 Juan Yescas
  2025-05-28  8:21 ` Vlastimil Babka
  2025-06-03 13:03 ` David Hildenbrand
  0 siblings, 2 replies; 8+ messages in thread
From: Juan Yescas @ 2025-05-21 21:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Juan Yescas, Zi Yan, linux-mm,
	linux-kernel
  Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.
This means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.

The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.

For example, in ARM, the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
   4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
  16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

There are some extreme cases for the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
   4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
  16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x400000>; /* 4 MiB */
        ...
    };
};

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x2000000>; /* 32 MiB */
        ...
    };
};

Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
allows to set the page block order in all the architectures.
The maximum page block order will be given by
ARCH_FORCE_MAX_ORDER.

By default, CONFIG_PAGE_BLOCK_ORDER will have the same
value that ARCH_FORCE_MAX_ORDER. This will make sure that
current kernel configurations won't be affected by this
change. It is a opt-in change.

This patch will allow to have the same CMA alignment
requirements for large page sizes (16KiB, 64KiB) as that
in 4kb kernels by setting a lower pageblock_order.

Tests:

- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
on 4k and 16k kernels.

- Verified that Transparent Huge Pages work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.

- Verified that dma-buf heaps allocations work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.

Benchmarks:

The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min
CMA alignment requirement the same as that in 4kb kernels (2MB).

- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure
the # of instructions and page-faults on 16k kernels.
The benchmark was executed 10 times. The averages are below:

           # instructions         |     #page-faults
    order 10     |  order 7       | order 10 | order 7
--------------------------------------------------------
 13,891,765,770	 | 11,425,777,314 |    220   |   217
 14,456,293,487	 | 12,660,819,302 |    224   |   219
 13,924,261,018	 | 13,243,970,736 |    217   |   221
 13,910,886,504	 | 13,845,519,630 |    217   |   221
 14,388,071,190	 | 13,498,583,098 |    223   |   224
 13,656,442,167	 | 12,915,831,681 |    216   |   218
 13,300,268,343	 | 12,930,484,776 |    222   |   218
 13,625,470,223	 | 14,234,092,777 |    219   |   218
 13,508,964,965	 | 13,432,689,094 |    225   |   219
 13,368,950,667	 | 13,683,587,37  |    219   |   225
-------------------------------------------------------------------
 13,803,137,433  | 13,131,974,268 |    220   |   220    Averages

There were 4.85% #instructions when order was 7, in comparison
with order 10.

     13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)

The number of page faults in order 7 and 10 were the same.

These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
 on the 16k kernels with pageblock_order 7 and 10.

order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
  15.8	 |  16.4    |         0.6        |     3.80%
  16.4	 |  16.2    |        -0.2        |    -1.22%
  16.6	 |  16.3    |        -0.3        |    -1.81%
  16.8	 |  16.3    |        -0.5        |    -2.98%
  16.6	 |  16.8    |         0.2        |     1.20%
-------------------------------------------------------------------
  16.44     16.4            -0.04	          -0.24%   Averages

The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
CC: Mike Rapoport <rppt@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---
Changes in v7:
  - Update alignment calculation to 2MiB as per David's
    observation.
  - Update page block order calculation in mm/mm_init.c for
    powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.

Changes in v6:
  - Applied the change provided by Zi Yan to fix
    the Kconfig. The change consists in evaluating
    to true or false in the if expression for range:
    range 1 <symbol> if <expression to eval true/false>.

Changes in v5:
  - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
    ranges with config definitions don't work in Kconfig,
    for example (range 1 MY_CONFIG).
  - Add PAGE_BLOCK_ORDER_MANUAL config for the
    page block order number. The default value was not
    defined.
  - Fix typos reported by Andrew.
  - Test default configs in powerpc. 

Changes in v4:
  - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
    validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
    compile time.
  - This change fixes the warning in:
    https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/

Changes in v3:
  - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
    as per Matthew's suggestion.
  - Update comments in pageblock-flags.h for pageblock_order
    value when THP or HugeTLB are not used.

Changes in v2:
  - Add Zi's Acked-by tag.
  - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
    per Zi and Matthew suggestion so it is available to
    all the architectures.
  - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
    ARCH_FORCE_MAX_ORDER is not available.

 include/linux/mmzone.h          | 16 ++++++++++++++++
 include/linux/pageblock-flags.h |  8 ++++----
 mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
 mm/mm_init.c                    |  2 +-
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6ccec1bf2896..05610337bbb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,6 +37,22 @@
 
 #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
 
+/* Defines the order for the number of pages that have a migrate type. */
+#ifndef CONFIG_PAGE_BLOCK_ORDER
+#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#else
+#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
+#endif /* CONFIG_PAGE_BLOCK_ORDER */
+
+/*
+ * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * which defines the order for the number of pages that can have a migrate type
+ */
+#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#endif
+
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fc6b9c87cb0a..e73a4292ef02 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
  * Huge pages are a constant size, but don't exceed the maximum allocation
  * granularity.
  */
-#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
 #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
 
-#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order		MAX_PAGE_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
+#define pageblock_order		PAGE_BLOCK_ORDER
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..13a5c4f6e6b6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -989,6 +989,40 @@ config CMA_AREAS
 
 	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
 
+#
+# Select this config option from the architecture Kconfig, if available, to set
+# the max page order for physically contiguous allocations.
+#
+config ARCH_FORCE_MAX_ORDER
+	int
+
+#
+# When ARCH_FORCE_MAX_ORDER is not defined,
+# the default page block order is MAX_PAGE_ORDER (10) as per
+# include/linux/mmzone.h.
+#
+config PAGE_BLOCK_ORDER
+	int "Page Block Order"
+	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
+	default 10 if ARCH_FORCE_MAX_ORDER = 0
+	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+	help
+	  The page block order refers to the power of two number of pages that
+	  are physically contiguous and can have a migrate type associated to
+	  them. The maximum size of the page block order is limited by
+	  ARCH_FORCE_MAX_ORDER.
+
+	  This config allows overriding the default page block order when the
+	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
+	  or MAX_PAGE_ORDER.
+
+	  Reducing pageblock order can negatively impact THP generation
+	  success rate. If your workloads uses THP heavily, please use this
+	  option with caution.
+
+	  Don't change if unsure.
+
 config MEM_SOFT_DIRTY
 	bool "Track memory changes"
 	depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 327764ca0ee4..ada5374764e4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1511,7 +1511,7 @@ static inline void setup_usemap(struct zone *zone) {}
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
 void __init set_pageblock_order(void)
 {
-	unsigned int order = MAX_PAGE_ORDER;
+	unsigned int order = PAGE_BLOCK_ORDER;
 
 	/* Check that pageblock_nr_pages has not already been setup */
 	if (pageblock_order)
-- 
2.49.0.1143.g0be31eac6b-goog



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
@ 2025-05-28  8:21 ` Vlastimil Babka
  2025-05-28 18:24   ` Andrew Morton
  2025-06-03 13:03 ` David Hildenbrand
  1 sibling, 1 reply; 8+ messages in thread
From: Vlastimil Babka @ 2025-05-28  8:21 UTC (permalink / raw)
  To: Juan Yescas, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, linux-mm, linux-kernel
  Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

On 5/21/25 23:57, Juan Yescas wrote:
> Problem: On large page size configurations (16KiB, 64KiB), the CMA
> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> and this causes the CMA reservations to be larger than necessary.
> This means that system will have less available MIGRATE_UNMOVABLE and
> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> 
> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> 
> For example, in ARM, the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> - CONFIG_TRANSPARENT_HUGEPAGE is set:
> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> -----------------------------------------------------------------------
>    4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>   16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>   64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> There are some extreme cases for the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> - CONFIG_HUGETLB_PAGE is NOT set
> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
> ------------------------------------------------------------------------
>    4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>   16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>   64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> This affects the CMA reservations for the drivers. If a driver in a
> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> reservation has to be 32MiB due to the alignment requirements:
> 
> reserved-memory {
>     ...
>     cma_test_reserve: cma_test_reserve {
>         compatible = "shared-dma-pool";
>         size = <0x0 0x400000>; /* 4 MiB */
>         ...
>     };
> };
> 
> reserved-memory {
>     ...
>     cma_test_reserve: cma_test_reserve {
>         compatible = "shared-dma-pool";
>         size = <0x0 0x2000000>; /* 32 MiB */
>         ...
>     };
> };
> 
> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> allows to set the page block order in all the architectures.
> The maximum page block order will be given by
> ARCH_FORCE_MAX_ORDER.
> 
> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> value that ARCH_FORCE_MAX_ORDER. This will make sure that
> current kernel configurations won't be affected by this
> change. It is a opt-in change.
> 
> This patch will allow to have the same CMA alignment
> requirements for large page sizes (16KiB, 64KiB) as that
> in 4kb kernels by setting a lower pageblock_order.
> 
> Tests:
> 
> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> on 4k and 16k kernels.
> 
> - Verified that Transparent Huge Pages work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> - Verified that dma-buf heaps allocations work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> Benchmarks:
> 
> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> reason for the pageblock_order 7 is because this value makes the min
> CMA alignment requirement the same as that in 4kb kernels (2MB).
> 
> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> (https://developer.android.com/ndk/guides/simpleperf) to measure
> the # of instructions and page-faults on 16k kernels.
> The benchmark was executed 10 times. The averages are below:
> 
>            # instructions         |     #page-faults
>     order 10     |  order 7       | order 10 | order 7
> --------------------------------------------------------
>  13,891,765,770	 | 11,425,777,314 |    220   |   217
>  14,456,293,487	 | 12,660,819,302 |    224   |   219
>  13,924,261,018	 | 13,243,970,736 |    217   |   221
>  13,910,886,504	 | 13,845,519,630 |    217   |   221
>  14,388,071,190	 | 13,498,583,098 |    223   |   224
>  13,656,442,167	 | 12,915,831,681 |    216   |   218
>  13,300,268,343	 | 12,930,484,776 |    222   |   218
>  13,625,470,223	 | 14,234,092,777 |    219   |   218
>  13,508,964,965	 | 13,432,689,094 |    225   |   219
>  13,368,950,667	 | 13,683,587,37  |    219   |   225
> -------------------------------------------------------------------
>  13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
> 
> There were 4.85% #instructions when order was 7, in comparison
> with order 10.
> 
>      13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> 
> The number of page faults in order 7 and 10 were the same.
> 
> These results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 
> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>  on the 16k kernels with pageblock_order 7 and 10.
> 
> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
> -------------------------------------------------------------------
>   15.8	 |  16.4    |         0.6        |     3.80%
>   16.4	 |  16.2    |        -0.2        |    -1.22%
>   16.6	 |  16.3    |        -0.3        |    -1.81%
>   16.8	 |  16.3    |        -0.5        |    -2.98%
>   16.6	 |  16.8    |         0.2        |     1.20%
> -------------------------------------------------------------------
>   16.44     16.4            -0.04	          -0.24%   Averages
> 
> The results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> CC: Mike Rapoport <rppt@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Juan Yescas <jyescas@google.com>
> Acked-by: Zi Yan <ziy@nvidia.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-28  8:21 ` Vlastimil Babka
@ 2025-05-28 18:24   ` Andrew Morton
  0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2025-05-28 18:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Juan Yescas, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
	masahiroy, Minchan Kim

On Wed, 28 May 2025 10:21:46 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Great, thanks.  I'll move this patch into mm-stable.  I'll be sending
two MM pull requests to Linus this cycle.  The below patches will be in
the second batch, next week.



#
m68k-remove-use-of-page-index.patch
mm-rename-page-index-to-page-__folio_index.patch
#
ntfs3-use-folios-more-in-ntfs_compress_write.patch
iov-remove-copy_page_from_iter_atomic.patch
#
zram-rename-zcomp_param_no_level.patch
zram-support-deflate-specific-params.patch
#
selftests-mm-deduplicate-test-logging-in-test_mlock_lock.patch
#
selftests-mm-deduplicate-default-page-size-test-results-in-thuge-gen.patch
#
memcg-disable-kmem-charging-in-nmi-for-unsupported-arch.patch
memcg-nmi-safe-memcg-stats-for-specific-archs.patch
memcg-add-nmi-safe-update-for-memcg_kmem.patch
memcg-nmi-safe-slab-stats-updates.patch
memcg-make-memcg_rstat_updated-nmi-safe.patch
#
mm-damon-core-avoid-destroyed-target-reference-from-damos-quota.patch
#
mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch
mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch
mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch
mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch
mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch
#
selftests-mm-skip-guard_regionsuffd-tests-when-uffd-is-not-present.patch
selftests-mm-skip-hugevm-test-if-kernel-config-file-is-not-present.patch
#
hugetlb-show-nr_huge_pages-in-report_hugepages.patch
#
#
mm-damon-kconfig-set-damon_vaddrpaddrsysfs-default-to-damon.patch
mm-damon-kconfig-enable-config_damon-by-default.patch
#
mmu_gather-move-tlb-flush-for-vm_pfnmap-vm_mixedmap-vmas-into-free_pgtables.patch
#
mm-rust-make-config_mmu-ifdefs-more-narrow.patch
#
kcov-rust-add-flags-for-kcov-with-rust.patch
#
#
selftests-mm-deduplicate-test-names-in-madv_populate.patch
#
mmu_notifiers-remove-leftover-stub-macros.patch
#
mm-add-config_page_block_order-to-select-page-block-order.patch
#



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
  2025-05-28  8:21 ` Vlastimil Babka
@ 2025-06-03 13:03 ` David Hildenbrand
  2025-06-03 14:55   ` Zi Yan
  2025-06-03 15:20   ` Juan Yescas
  1 sibling, 2 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-03 13:03 UTC (permalink / raw)
  To: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, linux-mm, linux-kernel
  Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

On 21.05.25 23:57, Juan Yescas wrote:
> Problem: On large page size configurations (16KiB, 64KiB), the CMA
> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> and this causes the CMA reservations to be larger than necessary.
> This means that system will have less available MIGRATE_UNMOVABLE and
> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> 
> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> 
> For example, in ARM, the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> - CONFIG_TRANSPARENT_HUGEPAGE is set:
> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> -----------------------------------------------------------------------
>     4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> There are some extreme cases for the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> - CONFIG_HUGETLB_PAGE is NOT set
> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
> ------------------------------------------------------------------------
>     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> This affects the CMA reservations for the drivers. If a driver in a
> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> reservation has to be 32MiB due to the alignment requirements:
> 
> reserved-memory {
>      ...
>      cma_test_reserve: cma_test_reserve {
>          compatible = "shared-dma-pool";
>          size = <0x0 0x400000>; /* 4 MiB */
>          ...
>      };
> };
> 
> reserved-memory {
>      ...
>      cma_test_reserve: cma_test_reserve {
>          compatible = "shared-dma-pool";
>          size = <0x0 0x2000000>; /* 32 MiB */
>          ...
>      };
> };
> 
> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> allows to set the page block order in all the architectures.
> The maximum page block order will be given by
> ARCH_FORCE_MAX_ORDER.
> 
> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> value that ARCH_FORCE_MAX_ORDER. This will make sure that
> current kernel configurations won't be affected by this
> change. It is a opt-in change.
> 
> This patch will allow to have the same CMA alignment
> requirements for large page sizes (16KiB, 64KiB) as that
> in 4kb kernels by setting a lower pageblock_order.
> 
> Tests:
> 
> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> on 4k and 16k kernels.
> 
> - Verified that Transparent Huge Pages work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> - Verified that dma-buf heaps allocations work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> Benchmarks:
> 
> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> reason for the pageblock_order 7 is because this value makes the min
> CMA alignment requirement the same as that in 4kb kernels (2MB).
> 
> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> (https://developer.android.com/ndk/guides/simpleperf) to measure
> the # of instructions and page-faults on 16k kernels.
> The benchmark was executed 10 times. The averages are below:
> 
>             # instructions         |     #page-faults
>      order 10     |  order 7       | order 10 | order 7
> --------------------------------------------------------
>   13,891,765,770	 | 11,425,777,314 |    220   |   217
>   14,456,293,487	 | 12,660,819,302 |    224   |   219
>   13,924,261,018	 | 13,243,970,736 |    217   |   221
>   13,910,886,504	 | 13,845,519,630 |    217   |   221
>   14,388,071,190	 | 13,498,583,098 |    223   |   224
>   13,656,442,167	 | 12,915,831,681 |    216   |   218
>   13,300,268,343	 | 12,930,484,776 |    222   |   218
>   13,625,470,223	 | 14,234,092,777 |    219   |   218
>   13,508,964,965	 | 13,432,689,094 |    225   |   219
>   13,368,950,667	 | 13,683,587,37  |    219   |   225
> -------------------------------------------------------------------
>   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
> 
> There were 4.85% #instructions when order was 7, in comparison
> with order 10.
> 
>       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> 
> The number of page faults in order 7 and 10 were the same.
> 
> These results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 
> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>   on the 16k kernels with pageblock_order 7 and 10.
> 
> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
> -------------------------------------------------------------------
>    15.8	 |  16.4    |         0.6        |     3.80%
>    16.4	 |  16.2    |        -0.2        |    -1.22%
>    16.6	 |  16.3    |        -0.3        |    -1.81%
>    16.8	 |  16.3    |        -0.5        |    -2.98%
>    16.6	 |  16.8    |         0.2        |     1.20%
> -------------------------------------------------------------------
>    16.44     16.4            -0.04	          -0.24%   Averages
> 
> The results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> CC: Mike Rapoport <rppt@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Juan Yescas <jyescas@google.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> ---
> Changes in v7:
>    - Update alignment calculation to 2MiB as per David's
>      observation.
>    - Update page block order calculation in mm/mm_init.c for
>      powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
> 
> Changes in v6:
>    - Applied the change provided by Zi Yan to fix
>      the Kconfig. The change consists in evaluating
>      to true or false in the if expression for range:
>      range 1 <symbol> if <expression to eval true/false>.
> 
> Changes in v5:
>    - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>      ranges with config definitions don't work in Kconfig,
>      for example (range 1 MY_CONFIG).
>    - Add PAGE_BLOCK_ORDER_MANUAL config for the
>      page block order number. The default value was not
>      defined.
>    - Fix typos reported by Andrew.
>    - Test default configs in powerpc.
> 
> Changes in v4:
>    - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>      validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>      compile time.
>    - This change fixes the warning in:
>      https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
> 
> Changes in v3:
>    - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>      as per Matthew's suggestion.
>    - Update comments in pageblock-flags.h for pageblock_order
>      value when THP or HugeTLB are not used.
> 
> Changes in v2:
>    - Add Zi's Acked-by tag.
>    - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>      per Zi and Matthew suggestion so it is available to
>      all the architectures.
>    - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>      ARCH_FORCE_MAX_ORDER is not available.
> 
>   include/linux/mmzone.h          | 16 ++++++++++++++++
>   include/linux/pageblock-flags.h |  8 ++++----
>   mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
>   mm/mm_init.c                    |  2 +-
>   4 files changed, 55 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6ccec1bf2896..05610337bbb6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -37,6 +37,22 @@
>   
>   #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>   
> +/* Defines the order for the number of pages that have a migrate type. */
> +#ifndef CONFIG_PAGE_BLOCK_ORDER
> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> +#else
> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> +
> +/*
> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> + * which defines the order for the number of pages that can have a migrate type
> + */
> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> +#endif
> +
>   /*
>    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>    * costly to service.  That is between allocation orders which should
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index fc6b9c87cb0a..e73a4292ef02 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>    * Huge pages are a constant size, but don't exceed the maximum allocation
>    * granularity.
>    */
> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>   
>   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>   
>   #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>   
> -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>   
>   #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order		MAX_PAGE_ORDER
> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
> +#define pageblock_order		PAGE_BLOCK_ORDER
>   
>   #endif /* CONFIG_HUGETLB_PAGE */
>   
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e113f713b493..13a5c4f6e6b6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -989,6 +989,40 @@ config CMA_AREAS
>   
>   	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
>   
> +#
> +# Select this config option from the architecture Kconfig, if available, to set
> +# the max page order for physically contiguous allocations.
> +#
> +config ARCH_FORCE_MAX_ORDER
> +	int
> +
> +#
> +# When ARCH_FORCE_MAX_ORDER is not defined,
> +# the default page block order is MAX_PAGE_ORDER (10) as per
> +# include/linux/mmzone.h.
> +#
> +config PAGE_BLOCK_ORDER
> +	int "Page Block Order"
> +	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
> +	default 10 if ARCH_FORCE_MAX_ORDER = 0
> +	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> +	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> +	help
> +	  The page block order refers to the power of two number of pages that
> +	  are physically contiguous and can have a migrate type associated to
> +	  them. The maximum size of the page block order is limited by
> +	  ARCH_FORCE_MAX_ORDER.
> +
> +	  This config allows overriding the default page block order when the
> +	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
> +	  or MAX_PAGE_ORDER.
> +
> +	  Reducing pageblock order can negatively impact THP generation
> +	  success rate. If your workloads uses THP heavily, please use this
> +	  option with caution.
> +
> +	  Don't change if unsure.


The semantics are now very confusing [1]. The default in x86-64 will be 
10, so we'll have

CONFIG_PAGE_BLOCK_ORDER=10


But then, we'll do this

#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, 
PAGE_BLOCK_ORDER)


So the actual pageblock order will be different than 
CONFIG_PAGE_BLOCK_ORDER.

Confusing.

Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL 
? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.

[1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-06-03 13:03 ` David Hildenbrand
@ 2025-06-03 14:55   ` Zi Yan
  2025-06-03 15:14     ` Zi Yan
  2025-06-03 15:20   ` Juan Yescas
  1 sibling, 1 reply; 8+ messages in thread
From: Zi Yan @ 2025-06-03 14:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
	masahiroy, Minchan Kim

On 3 Jun 2025, at 9:03, David Hildenbrand wrote:

> On 21.05.25 23:57, Juan Yescas wrote:
>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>> and this causes the CMA reservations to be larger than necessary.
>> This means that system will have less available MIGRATE_UNMOVABLE and
>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>
>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>
>> For example, in ARM, the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>> -----------------------------------------------------------------------
>>     4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>>    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>
>> There are some extreme cases for the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>> - CONFIG_HUGETLB_PAGE is NOT set
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
>> ------------------------------------------------------------------------
>>     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>>    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>
>> This affects the CMA reservations for the drivers. If a driver in a
>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>> reservation has to be 32MiB due to the alignment requirements:
>>
>> reserved-memory {
>>      ...
>>      cma_test_reserve: cma_test_reserve {
>>          compatible = "shared-dma-pool";
>>          size = <0x0 0x400000>; /* 4 MiB */
>>          ...
>>      };
>> };
>>
>> reserved-memory {
>>      ...
>>      cma_test_reserve: cma_test_reserve {
>>          compatible = "shared-dma-pool";
>>          size = <0x0 0x2000000>; /* 32 MiB */
>>          ...
>>      };
>> };
>>
>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>> allows to set the page block order in all the architectures.
>> The maximum page block order will be given by
>> ARCH_FORCE_MAX_ORDER.
>>
>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>> current kernel configurations won't be affected by this
>> change. It is a opt-in change.
>>
>> This patch will allow to have the same CMA alignment
>> requirements for large page sizes (16KiB, 64KiB) as that
>> in 4kb kernels by setting a lower pageblock_order.
>>
>> Tests:
>>
>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>> on 4k and 16k kernels.
>>
>> - Verified that Transparent Huge Pages work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>>
>> - Verified that dma-buf heaps allocations work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>>
>> Benchmarks:
>>
>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>> reason for the pageblock_order 7 is because this value makes the min
>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>
>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>> the # of instructions and page-faults on 16k kernels.
>> The benchmark was executed 10 times. The averages are below:
>>
>>             # instructions         |     #page-faults
>>      order 10     |  order 7       | order 10 | order 7
>> --------------------------------------------------------
>>   13,891,765,770	 | 11,425,777,314 |    220   |   217
>>   14,456,293,487	 | 12,660,819,302 |    224   |   219
>>   13,924,261,018	 | 13,243,970,736 |    217   |   221
>>   13,910,886,504	 | 13,845,519,630 |    217   |   221
>>   14,388,071,190	 | 13,498,583,098 |    223   |   224
>>   13,656,442,167	 | 12,915,831,681 |    216   |   218
>>   13,300,268,343	 | 12,930,484,776 |    222   |   218
>>   13,625,470,223	 | 14,234,092,777 |    219   |   218
>>   13,508,964,965	 | 13,432,689,094 |    225   |   219
>>   13,368,950,667	 | 13,683,587,37  |    219   |   225
>> -------------------------------------------------------------------
>>   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
>>
>> There were 4.85% #instructions when order was 7, in comparison
>> with order 10.
>>
>>       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>
>> The number of page faults in order 7 and 10 were the same.
>>
>> These results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>   on the 16k kernels with pageblock_order 7 and 10.
>>
>> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
>> -------------------------------------------------------------------
>>    15.8	 |  16.4    |         0.6        |     3.80%
>>    16.4	 |  16.2    |        -0.2        |    -1.22%
>>    16.6	 |  16.3    |        -0.3        |    -1.81%
>>    16.8	 |  16.3    |        -0.5        |    -2.98%
>>    16.6	 |  16.8    |         0.2        |     1.20%
>> -------------------------------------------------------------------
>>    16.44     16.4            -0.04	          -0.24%   Averages
>>
>> The results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> CC: Mike Rapoport <rppt@kernel.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: Juan Yescas <jyescas@google.com>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> ---
>> Changes in v7:
>>    - Update alignment calculation to 2MiB as per David's
>>      observation.
>>    - Update page block order calculation in mm/mm_init.c for
>>      powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>
>> Changes in v6:
>>    - Applied the change provided by Zi Yan to fix
>>      the Kconfig. The change consists in evaluating
>>      to true or false in the if expression for range:
>>      range 1 <symbol> if <expression to eval true/false>.
>>
>> Changes in v5:
>>    - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>      ranges with config definitions don't work in Kconfig,
>>      for example (range 1 MY_CONFIG).
>>    - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>      page block order number. The default value was not
>>      defined.
>>    - Fix typos reported by Andrew.
>>    - Test default configs in powerpc.
>>
>> Changes in v4:
>>    - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>      validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>      compile time.
>>    - This change fixes the warning in:
>>     https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>
>> Changes in v3:
>>    - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>      as per Matthew's suggestion.
>>    - Update comments in pageblock-flags.h for pageblock_order
>>      value when THP or HugeTLB are not used.
>>
>> Changes in v2:
>>    - Add Zi's Acked-by tag.
>>    - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>      per Zi and Matthew suggestion so it is available to
>>      all the architectures.
>>    - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>      ARCH_FORCE_MAX_ORDER is not available.
>>
>>   include/linux/mmzone.h          | 16 ++++++++++++++++
>>   include/linux/pageblock-flags.h |  8 ++++----
>>   mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
>>   mm/mm_init.c                    |  2 +-
>>   4 files changed, 55 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 6ccec1bf2896..05610337bbb6 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -37,6 +37,22 @@
>>    #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>  +/* Defines the order for the number of pages that have a migrate type. */
>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>> +#else
>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>> +
>> +/*
>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>> + * which defines the order for the number of pages that can have a migrate type
>> + */
>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>> +#endif
>> +
>>   /*
>>    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>    * costly to service.  That is between allocation orders which should
>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>> index fc6b9c87cb0a..e73a4292ef02 100644
>> --- a/include/linux/pageblock-flags.h
>> +++ b/include/linux/pageblock-flags.h
>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>    * Huge pages are a constant size, but don't exceed the maximum allocation
>>    * granularity.
>>    */
>> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>    #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>    #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>    #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>> -#define pageblock_order		MAX_PAGE_ORDER
>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>> +#define pageblock_order		PAGE_BLOCK_ORDER
>>    #endif /* CONFIG_HUGETLB_PAGE */
>>  diff --git a/mm/Kconfig b/mm/Kconfig
>> index e113f713b493..13a5c4f6e6b6 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>    	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>  +#
>> +# Select this config option from the architecture Kconfig, if available, to set
>> +# the max page order for physically contiguous allocations.
>> +#
>> +config ARCH_FORCE_MAX_ORDER
>> +	int
>> +
>> +#
>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>> +# include/linux/mmzone.h.
>> +#
>> +config PAGE_BLOCK_ORDER
>> +	int "Page Block Order"
>> +	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>> +	default 10 if ARCH_FORCE_MAX_ORDER = 0
>> +	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>> +	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>> +	help
>> +	  The page block order refers to the power of two number of pages that
>> +	  are physically contiguous and can have a migrate type associated to
>> +	  them. The maximum size of the page block order is limited by
>> +	  ARCH_FORCE_MAX_ORDER.
>> +
>> +	  This config allows overriding the default page block order when the
>> +	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>> +	  or MAX_PAGE_ORDER.
>> +
>> +	  Reducing pageblock order can negatively impact THP generation
>> +	  success rate. If your workloads uses THP heavily, please use this
>> +	  option with caution.
>> +
>> +	  Don't change if unsure.
>
>
> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>
> CONFIG_PAGE_BLOCK_ORDER=10
>
>
> But then, we'll do this
>
> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>
>
> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>
> Confusing.
>
> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.

IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.

>
> [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928
>


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-06-03 14:55   ` Zi Yan
@ 2025-06-03 15:14     ` Zi Yan
  2025-06-03 15:42       ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Zi Yan @ 2025-06-03 15:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
	masahiroy, Minchan Kim

On 3 Jun 2025, at 10:55, Zi Yan wrote:

> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>
>> On 21.05.25 23:57, Juan Yescas wrote:
>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>> and this causes the CMA reservations to be larger than necessary.
>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>
>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>
>>> For example, in ARM, the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>> -----------------------------------------------------------------------
>>>     4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>>>    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>>>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> There are some extreme cases for the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
>>> ------------------------------------------------------------------------
>>>     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>>>    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>>>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> This affects the CMA reservations for the drivers. If a driver in a
>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>> reservation has to be 32MiB due to the alignment requirements:
>>>
>>> reserved-memory {
>>>      ...
>>>      cma_test_reserve: cma_test_reserve {
>>>          compatible = "shared-dma-pool";
>>>          size = <0x0 0x400000>; /* 4 MiB */
>>>          ...
>>>      };
>>> };
>>>
>>> reserved-memory {
>>>      ...
>>>      cma_test_reserve: cma_test_reserve {
>>>          compatible = "shared-dma-pool";
>>>          size = <0x0 0x2000000>; /* 32 MiB */
>>>          ...
>>>      };
>>> };
>>>
>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>> allows to set the page block order in all the architectures.
>>> The maximum page block order will be given by
>>> ARCH_FORCE_MAX_ORDER.
>>>
>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>> current kernel configurations won't be affected by this
>>> change. It is a opt-in change.
>>>
>>> This patch will allow to have the same CMA alignment
>>> requirements for large page sizes (16KiB, 64KiB) as that
>>> in 4kb kernels by setting a lower pageblock_order.
>>>
>>> Tests:
>>>
>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>> on 4k and 16k kernels.
>>>
>>> - Verified that Transparent Huge Pages work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> Benchmarks:
>>>
>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>> reason for the pageblock_order 7 is because this value makes the min
>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>
>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>> the # of instructions and page-faults on 16k kernels.
>>> The benchmark was executed 10 times. The averages are below:
>>>
>>>             # instructions         |     #page-faults
>>>      order 10     |  order 7       | order 10 | order 7
>>> --------------------------------------------------------
>>>   13,891,765,770	 | 11,425,777,314 |    220   |   217
>>>   14,456,293,487	 | 12,660,819,302 |    224   |   219
>>>   13,924,261,018	 | 13,243,970,736 |    217   |   221
>>>   13,910,886,504	 | 13,845,519,630 |    217   |   221
>>>   14,388,071,190	 | 13,498,583,098 |    223   |   224
>>>   13,656,442,167	 | 12,915,831,681 |    216   |   218
>>>   13,300,268,343	 | 12,930,484,776 |    222   |   218
>>>   13,625,470,223	 | 14,234,092,777 |    219   |   218
>>>   13,508,964,965	 | 13,432,689,094 |    225   |   219
>>>   13,368,950,667	 | 13,683,587,37  |    219   |   225
>>> -------------------------------------------------------------------
>>>   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
>>>
>>> There were 4.85% #instructions when order was 7, in comparison
>>> with order 10.
>>>
>>>       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>
>>> The number of page faults in order 7 and 10 were the same.
>>>
>>> These results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>>   on the 16k kernels with pageblock_order 7 and 10.
>>>
>>> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
>>> -------------------------------------------------------------------
>>>    15.8	 |  16.4    |         0.6        |     3.80%
>>>    16.4	 |  16.2    |        -0.2        |    -1.22%
>>>    16.6	 |  16.3    |        -0.3        |    -1.81%
>>>    16.8	 |  16.3    |        -0.5        |    -2.98%
>>>    16.6	 |  16.8    |         0.2        |     1.20%
>>> -------------------------------------------------------------------
>>>    16.44     16.4            -0.04	          -0.24%   Averages
>>>
>>> The results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> CC: Mike Rapoport <rppt@kernel.org>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Signed-off-by: Juan Yescas <jyescas@google.com>
>>> Acked-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> Changes in v7:
>>>    - Update alignment calculation to 2MiB as per David's
>>>      observation.
>>>    - Update page block order calculation in mm/mm_init.c for
>>>      powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>
>>> Changes in v6:
>>>    - Applied the change provided by Zi Yan to fix
>>>      the Kconfig. The change consists in evaluating
>>>      to true or false in the if expression for range:
>>>      range 1 <symbol> if <expression to eval true/false>.
>>>
>>> Changes in v5:
>>>    - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>>      ranges with config definitions don't work in Kconfig,
>>>      for example (range 1 MY_CONFIG).
>>>    - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>>      page block order number. The default value was not
>>>      defined.
>>>    - Fix typos reported by Andrew.
>>>    - Test default configs in powerpc.
>>>
>>> Changes in v4:
>>>    - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>>      validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>>      compile time.
>>>    - This change fixes the warning in:
>>>    https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>>
>>> Changes in v3:
>>>    - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>>      as per Matthew's suggestion.
>>>    - Update comments in pageblock-flags.h for pageblock_order
>>>      value when THP or HugeTLB are not used.
>>>
>>> Changes in v2:
>>>    - Add Zi's Acked-by tag.
>>>    - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>>      per Zi and Matthew suggestion so it is available to
>>>      all the architectures.
>>>    - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>>      ARCH_FORCE_MAX_ORDER is not available.
>>>
>>>   include/linux/mmzone.h          | 16 ++++++++++++++++
>>>   include/linux/pageblock-flags.h |  8 ++++----
>>>   mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
>>>   mm/mm_init.c                    |  2 +-
>>>   4 files changed, 55 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 6ccec1bf2896..05610337bbb6 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -37,6 +37,22 @@
>>>    #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>>  +/* Defines the order for the number of pages that have a migrate type. */
>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>> +#else
>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>> +
>>> +/*
>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>> + * which defines the order for the number of pages that can have a migrate type
>>> + */
>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>> +#endif
>>> +
>>>   /*
>>>    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>>    * costly to service.  That is between allocation orders which should
>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>> --- a/include/linux/pageblock-flags.h
>>> +++ b/include/linux/pageblock-flags.h
>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>>    * Huge pages are a constant size, but don't exceed the maximum allocation
>>>    * granularity.
>>>    */
>>> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>>    #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>>    #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>  -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>    #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>  -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>> -#define pageblock_order		MAX_PAGE_ORDER
>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>> +#define pageblock_order		PAGE_BLOCK_ORDER
>>>    #endif /* CONFIG_HUGETLB_PAGE */
>>>  diff --git a/mm/Kconfig b/mm/Kconfig
>>> index e113f713b493..13a5c4f6e6b6 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>>    	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>>  +#
>>> +# Select this config option from the architecture Kconfig, if available, to set
>>> +# the max page order for physically contiguous allocations.
>>> +#
>>> +config ARCH_FORCE_MAX_ORDER
>>> +	int
>>> +
>>> +#
>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>> +# include/linux/mmzone.h.
>>> +#
>>> +config PAGE_BLOCK_ORDER
>>> +	int "Page Block Order"
>>> +	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>> +	default 10 if ARCH_FORCE_MAX_ORDER = 0
>>> +	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> +	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> +	help
>>> +	  The page block order refers to the power of two number of pages that
>>> +	  are physically contiguous and can have a migrate type associated to
>>> +	  them. The maximum size of the page block order is limited by
>>> +	  ARCH_FORCE_MAX_ORDER.
>>> +
>>> +	  This config allows overriding the default page block order when the
>>> +	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>> +	  or MAX_PAGE_ORDER.
>>> +
>>> +	  Reducing pageblock order can negatively impact THP generation
>>> +	  success rate. If your workloads uses THP heavily, please use this
>>> +	  option with caution.
>>> +
>>> +	  Don't change if unsure.
>>
>>
>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>
>> CONFIG_PAGE_BLOCK_ORDER=10
>>
>>
>> But then, we'll do this
>>
>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>
>>
>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>
>> Confusing.
>>
>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>
> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.

LIMIT might be still ambiguous, since it can be lower limit or upper limit.
CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
if it looks good to you, I can send it out properly.

From 7fff4fd87ed3aa160db8d2f0d9e5b219321df4f9 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 3 Jun 2025 11:09:37 -0400
Subject: [PATCH] mm: rename CONFIG_PAGE_BLOCK_ORDER to
 CONFIG_PAGE_BLOCK_ORDER_CEIL.

The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mmzone.h          | 14 +++++++-------
 include/linux/pageblock-flags.h |  8 ++++----
 mm/Kconfig                      | 15 ++++++++-------
 3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..523b407e63e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,19 +38,19 @@
 #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)

 /* Defines the order for the number of pages that have a migrate type. */
-#ifndef CONFIG_PAGE_BLOCK_ORDER
-#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#ifndef CONFIG_PAGE_BLOCK_ORDER_CEIL
+#define PAGE_BLOCK_ORDER_CEIL MAX_PAGE_ORDER
 #else
-#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
-#endif /* CONFIG_PAGE_BLOCK_ORDER */
+#define PAGE_BLOCK_ORDER_CEIL CONFIG_PAGE_BLOCK_ORDER_CEIL
+#endif /* CONFIG_PAGE_BLOCK_ORDER_CEIL */

 /*
  * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
- * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER_CEIL,
  * which defines the order for the number of pages that can have a migrate type
  */
-#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
-#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#if (PAGE_BLOCK_ORDER_CEIL > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER_CEIL
 #endif

 /*
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e73a4292ef02..e7a86cd238c2 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
  * Huge pages are a constant size, but don't exceed the maximum allocation
  * granularity.
  */
-#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER_CEIL)

 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

 #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)

-#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER_CEIL)

 #else /* CONFIG_TRANSPARENT_HUGEPAGE */

-/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
-#define pageblock_order		PAGE_BLOCK_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER_CEIL */
+#define pageblock_order		PAGE_BLOCK_ORDER_CEIL

 #endif /* CONFIG_HUGETLB_PAGE */

diff --git a/mm/Kconfig b/mm/Kconfig
index eccb2e46ffcb..3b27e644bd1f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1017,8 +1017,8 @@ config ARCH_FORCE_MAX_ORDER
 # the default page block order is MAX_PAGE_ORDER (10) as per
 # include/linux/mmzone.h.
 #
-config PAGE_BLOCK_ORDER
-	int "Page Block Order"
+config PAGE_BLOCK_ORDER_CEIL
+	int "Page Block Order Upper Limit"
 	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
 	default 10 if ARCH_FORCE_MAX_ORDER = 0
 	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
@@ -1026,12 +1026,13 @@ config PAGE_BLOCK_ORDER
 	help
 	  The page block order refers to the power of two number of pages that
 	  are physically contiguous and can have a migrate type associated to
-	  them. The maximum size of the page block order is limited by
-	  ARCH_FORCE_MAX_ORDER.
+	  them. The maximum size of the page block order is at least limited by
+	  ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER.

-	  This config allows overriding the default page block order when the
-	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
-	  or MAX_PAGE_ORDER.
+	  This config adds a new upper limit of default page block
+	  order when the page block order is required to be smaller than
+	  ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER or other limits
+	  (see include/linux/pageblock-flags.h for details).

 	  Reducing pageblock order can negatively impact THP generation
 	  success rate. If your workloads uses THP heavily, please use this
-- 
2.47.2



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-06-03 13:03 ` David Hildenbrand
  2025-06-03 14:55   ` Zi Yan
@ 2025-06-03 15:20   ` Juan Yescas
  1 sibling, 0 replies; 8+ messages in thread
From: Juan Yescas @ 2025-06-03 15:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
	linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
	masahiroy, Minchan Kim

On Tue, Jun 3, 2025 at 6:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.05.25 23:57, Juan Yescas wrote:
> > Problem: On large page size configurations (16KiB, 64KiB), the CMA
> > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> > and this causes the CMA reservations to be larger than necessary.
> > This means that system will have less available MIGRATE_UNMOVABLE and
> > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> >
> > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> >
> > For example, in ARM, the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> > - CONFIG_TRANSPARENT_HUGEPAGE is set:
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> > -----------------------------------------------------------------------
> >     4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
> >    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
> >    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> >
> > There are some extreme cases for the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> > - CONFIG_HUGETLB_PAGE is NOT set
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
> > ------------------------------------------------------------------------
> >     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
> >    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
> >    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> >
> > This affects the CMA reservations for the drivers. If a driver in a
> > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> > reservation has to be 32MiB due to the alignment requirements:
> >
> > reserved-memory {
> >      ...
> >      cma_test_reserve: cma_test_reserve {
> >          compatible = "shared-dma-pool";
> >          size = <0x0 0x400000>; /* 4 MiB */
> >          ...
> >      };
> > };
> >
> > reserved-memory {
> >      ...
> >      cma_test_reserve: cma_test_reserve {
> >          compatible = "shared-dma-pool";
> >          size = <0x0 0x2000000>; /* 32 MiB */
> >          ...
> >      };
> > };
> >
> > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> > allows to set the page block order in all the architectures.
> > The maximum page block order will be given by
> > ARCH_FORCE_MAX_ORDER.
> >
> > By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> > value that ARCH_FORCE_MAX_ORDER. This will make sure that
> > current kernel configurations won't be affected by this
> > change. It is a opt-in change.
> >
> > This patch will allow to have the same CMA alignment
> > requirements for large page sizes (16KiB, 64KiB) as that
> > in 4kb kernels by setting a lower pageblock_order.
> >
> > Tests:
> >
> > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> > on 4k and 16k kernels.
> >
> > - Verified that Transparent Huge Pages work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > - Verified that dma-buf heaps allocations work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > Benchmarks:
> >
> > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> > reason for the pageblock_order 7 is because this value makes the min
> > CMA alignment requirement the same as that in 4kb kernels (2MB).
> >
> > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> > (https://developer.android.com/ndk/guides/simpleperf) to measure
> > the # of instructions and page-faults on 16k kernels.
> > The benchmark was executed 10 times. The averages are below:
> >
> >             # instructions         |     #page-faults
> >      order 10     |  order 7       | order 10 | order 7
> > --------------------------------------------------------
> >   13,891,765,770       | 11,425,777,314 |    220   |   217
> >   14,456,293,487       | 12,660,819,302 |    224   |   219
> >   13,924,261,018       | 13,243,970,736 |    217   |   221
> >   13,910,886,504       | 13,845,519,630 |    217   |   221
> >   14,388,071,190       | 13,498,583,098 |    223   |   224
> >   13,656,442,167       | 12,915,831,681 |    216   |   218
> >   13,300,268,343       | 12,930,484,776 |    222   |   218
> >   13,625,470,223       | 14,234,092,777 |    219   |   218
> >   13,508,964,965       | 13,432,689,094 |    225   |   219
> >   13,368,950,667       | 13,683,587,37  |    219   |   225
> > -------------------------------------------------------------------
> >   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
> >
> > There were 4.85% #instructions when order was 7, in comparison
> > with order 10.
> >
> >       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> >
> > The number of page faults in order 7 and 10 were the same.
> >
> > These results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
> > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
> >   on the 16k kernels with pageblock_order 7 and 10.
> >
> > order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
> > -------------------------------------------------------------------
> >    15.8        |  16.4    |         0.6        |     3.80%
> >    16.4        |  16.2    |        -0.2        |    -1.22%
> >    16.6        |  16.3    |        -0.3        |    -1.81%
> >    16.8        |  16.3    |        -0.5        |    -2.98%
> >    16.6        |  16.8    |         0.2        |     1.20%
> > -------------------------------------------------------------------
> >    16.44     16.4            -0.04              -0.24%   Averages
> >
> > The results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > CC: Mike Rapoport <rppt@kernel.org>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Juan Yescas <jyescas@google.com>
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > ---
> > Changes in v7:
> >    - Update alignment calculation to 2MiB as per David's
> >      observation.
> >    - Update page block order calculation in mm/mm_init.c for
> >      powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
> >
> > Changes in v6:
> >    - Applied the change provided by Zi Yan to fix
> >      the Kconfig. The change consists in evaluating
> >      to true or false in the if expression for range:
> >      range 1 <symbol> if <expression to eval true/false>.
> >
> > Changes in v5:
> >    - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
> >      ranges with config definitions don't work in Kconfig,
> >      for example (range 1 MY_CONFIG).
> >    - Add PAGE_BLOCK_ORDER_MANUAL config for the
> >      page block order number. The default value was not
> >      defined.
> >    - Fix typos reported by Andrew.
> >    - Test default configs in powerpc.
> >
> > Changes in v4:
> >    - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
> >      validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
> >      compile time.
> >    - This change fixes the warning in:
> >      https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
> >
> > Changes in v3:
> >    - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
> >      as per Matthew's suggestion.
> >    - Update comments in pageblock-flags.h for pageblock_order
> >      value when THP or HugeTLB are not used.
> >
> > Changes in v2:
> >    - Add Zi's Acked-by tag.
> >    - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
> >      per Zi and Matthew suggestion so it is available to
> >      all the architectures.
> >    - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
> >      ARCH_FORCE_MAX_ORDER is not available.
> >
> >   include/linux/mmzone.h          | 16 ++++++++++++++++
> >   include/linux/pageblock-flags.h |  8 ++++----
> >   mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
> >   mm/mm_init.c                    |  2 +-
> >   4 files changed, 55 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 6ccec1bf2896..05610337bbb6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -37,6 +37,22 @@
> >
> >   #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
> >
> > +/* Defines the order for the number of pages that have a migrate type. */
> > +#ifndef CONFIG_PAGE_BLOCK_ORDER
> > +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> > +#else
> > +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> > +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> > +
> > +/*
> > + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> > + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> > + * which defines the order for the number of pages that can have a migrate type
> > + */
> > +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> > +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> > +#endif
> > +
> >   /*
> >    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> >    * costly to service.  That is between allocation orders which should
> > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> > index fc6b9c87cb0a..e73a4292ef02 100644
> > --- a/include/linux/pageblock-flags.h
> > +++ b/include/linux/pageblock-flags.h
> > @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
> >    * Huge pages are a constant size, but don't exceed the maximum allocation
> >    * granularity.
> >    */
> > -#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
> >
> >   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
> >
> >   #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >
> > -#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
> >
> >   #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> > -#define pageblock_order              MAX_PAGE_ORDER
> > +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
> > +#define pageblock_order              PAGE_BLOCK_ORDER
> >
> >   #endif /* CONFIG_HUGETLB_PAGE */
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index e113f713b493..13a5c4f6e6b6 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -989,6 +989,40 @@ config CMA_AREAS
> >
> >         If unsure, leave the default value "8" in UMA and "20" in NUMA.
> >
> > +#
> > +# Select this config option from the architecture Kconfig, if available, to set
> > +# the max page order for physically contiguous allocations.
> > +#
> > +config ARCH_FORCE_MAX_ORDER
> > +     int
> > +
> > +#
> > +# When ARCH_FORCE_MAX_ORDER is not defined,
> > +# the default page block order is MAX_PAGE_ORDER (10) as per
> > +# include/linux/mmzone.h.
> > +#
> > +config PAGE_BLOCK_ORDER
> > +     int "Page Block Order"
> > +     range 1 10 if ARCH_FORCE_MAX_ORDER = 0
> > +     default 10 if ARCH_FORCE_MAX_ORDER = 0
> > +     range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> > +     default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> > +     help
> > +       The page block order refers to the power of two number of pages that
> > +       are physically contiguous and can have a migrate type associated to
> > +       them. The maximum size of the page block order is limited by
> > +       ARCH_FORCE_MAX_ORDER.
> > +
> > +       This config allows overriding the default page block order when the
> > +       page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
> > +       or MAX_PAGE_ORDER.
> > +
> > +       Reducing pageblock order can negatively impact THP generation
> > +       success rate. If your workloads uses THP heavily, please use this
> > +       option with caution.
> > +
> > +       Don't change if unsure.
>
>
> The semantics are now very confusing [1]. The default in x86-64 will be
> 10, so we'll have
>
> CONFIG_PAGE_BLOCK_ORDER=10
>
>
> But then, we'll do this
>
> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER,
> PAGE_BLOCK_ORDER)
>
>
> So the actual pageblock order will be different than
> CONFIG_PAGE_BLOCK_ORDER.
>
> Confusing.

I agree that it becomes confusing due that pageblock_order value
depends on whether THP, HugeTLB
are set or not.

>
> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL
> ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>

We could rename the configuration to CONFIG_PAGE_BLOCK_ORDER_CEIL.

> [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-06-03 15:14     ` Zi Yan
@ 2025-06-03 15:42       ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-03 15:42 UTC (permalink / raw)
  To: Zi Yan
  Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
	masahiroy, Minchan Kim

On 03.06.25 17:14, Zi Yan wrote:
> On 3 Jun 2025, at 10:55, Zi Yan wrote:
> 
>> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>>
>>> On 21.05.25 23:57, Juan Yescas wrote:
>>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>>> and this causes the CMA reservations to be larger than necessary.
>>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>>
>>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>>
>>>> For example, in ARM, the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>>> -----------------------------------------------------------------------
>>>>      4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
>>>>     16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>>>>     64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> There are some extreme cases for the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
>>>> ------------------------------------------------------------------------
>>>>      4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>>>>     16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>>>>     64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> This affects the CMA reservations for the drivers. If a driver in a
>>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>>> reservation has to be 32MiB due to the alignment requirements:
>>>>
>>>> reserved-memory {
>>>>       ...
>>>>       cma_test_reserve: cma_test_reserve {
>>>>           compatible = "shared-dma-pool";
>>>>           size = <0x0 0x400000>; /* 4 MiB */
>>>>           ...
>>>>       };
>>>> };
>>>>
>>>> reserved-memory {
>>>>       ...
>>>>       cma_test_reserve: cma_test_reserve {
>>>>           compatible = "shared-dma-pool";
>>>>           size = <0x0 0x2000000>; /* 32 MiB */
>>>>           ...
>>>>       };
>>>> };
>>>>
>>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>>> allows to set the page block order in all the architectures.
>>>> The maximum page block order will be given by
>>>> ARCH_FORCE_MAX_ORDER.
>>>>
>>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>>> current kernel configurations won't be affected by this
>>>> change. It is a opt-in change.
>>>>
>>>> This patch will allow to have the same CMA alignment
>>>> requirements for large page sizes (16KiB, 64KiB) as that
>>>> in 4kb kernels by setting a lower pageblock_order.
>>>>
>>>> Tests:
>>>>
>>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>>> on 4k and 16k kernels.
>>>>
>>>> - Verified that Transparent Huge Pages work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> Benchmarks:
>>>>
>>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>>> reason for the pageblock_order 7 is because this value makes the min
>>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>>
>>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>>> the # of instructions and page-faults on 16k kernels.
>>>> The benchmark was executed 10 times. The averages are below:
>>>>
>>>>              # instructions         |     #page-faults
>>>>       order 10     |  order 7       | order 10 | order 7
>>>> --------------------------------------------------------
>>>>    13,891,765,770	 | 11,425,777,314 |    220   |   217
>>>>    14,456,293,487	 | 12,660,819,302 |    224   |   219
>>>>    13,924,261,018	 | 13,243,970,736 |    217   |   221
>>>>    13,910,886,504	 | 13,845,519,630 |    217   |   221
>>>>    14,388,071,190	 | 13,498,583,098 |    223   |   224
>>>>    13,656,442,167	 | 12,915,831,681 |    216   |   218
>>>>    13,300,268,343	 | 12,930,484,776 |    222   |   218
>>>>    13,625,470,223	 | 14,234,092,777 |    219   |   218
>>>>    13,508,964,965	 | 13,432,689,094 |    225   |   219
>>>>    13,368,950,667	 | 13,683,587,37  |    219   |   225
>>>> -------------------------------------------------------------------
>>>>    13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
>>>>
>>>> There were 4.85% #instructions when order was 7, in comparison
>>>> with order 10.
>>>>
>>>>        13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>>
>>>> The number of page faults in order 7 and 10 were the same.
>>>>
>>>> These results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>>>    on the 16k kernels with pageblock_order 7 and 10.
>>>>
>>>> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
>>>> -------------------------------------------------------------------
>>>>     15.8	 |  16.4    |         0.6        |     3.80%
>>>>     16.4	 |  16.2    |        -0.2        |    -1.22%
>>>>     16.6	 |  16.3    |        -0.3        |    -1.81%
>>>>     16.8	 |  16.3    |        -0.5        |    -2.98%
>>>>     16.6	 |  16.8    |         0.2        |     1.20%
>>>> -------------------------------------------------------------------
>>>>     16.44     16.4            -0.04	          -0.24%   Averages
>>>>
>>>> The results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> CC: Mike Rapoport <rppt@kernel.org>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Signed-off-by: Juan Yescas <jyescas@google.com>
>>>> Acked-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>> Changes in v7:
>>>>     - Update alignment calculation to 2MiB as per David's
>>>>       observation.
>>>>     - Update page block order calculation in mm/mm_init.c for
>>>>       powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>>
>>>> Changes in v6:
>>>>     - Applied the change provided by Zi Yan to fix
>>>>       the Kconfig. The change consists in evaluating
>>>>       to true or false in the if expression for range:
>>>>       range 1 <symbol> if <expression to eval true/false>.
>>>>
>>>> Changes in v5:
>>>>     - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>>>       ranges with config definitions don't work in Kconfig,
>>>>       for example (range 1 MY_CONFIG).
>>>>     - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>>>       page block order number. The default value was not
>>>>       defined.
>>>>     - Fix typos reported by Andrew.
>>>>     - Test default configs in powerpc.
>>>>
>>>> Changes in v4:
>>>>     - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>>>       validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>>>       compile time.
>>>>     - This change fixes the warning in:
>>>>     https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>>>
>>>> Changes in v3:
>>>>     - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>>>       as per Matthew's suggestion.
>>>>     - Update comments in pageblock-flags.h for pageblock_order
>>>>       value when THP or HugeTLB are not used.
>>>>
>>>> Changes in v2:
>>>>     - Add Zi's Acked-by tag.
>>>>     - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>>>       per Zi and Matthew suggestion so it is available to
>>>>       all the architectures.
>>>>     - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>>>       ARCH_FORCE_MAX_ORDER is not available.
>>>>
>>>>    include/linux/mmzone.h          | 16 ++++++++++++++++
>>>>    include/linux/pageblock-flags.h |  8 ++++----
>>>>    mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
>>>>    mm/mm_init.c                    |  2 +-
>>>>    4 files changed, 55 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 6ccec1bf2896..05610337bbb6 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -37,6 +37,22 @@
>>>>     #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>>>   +/* Defines the order for the number of pages that have a migrate type. */
>>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>>> +#else
>>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>>> +
>>>> +/*
>>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>>> + * which defines the order for the number of pages that can have a migrate type
>>>> + */
>>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>>>     * costly to service.  That is between allocation orders which should
>>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>>> --- a/include/linux/pageblock-flags.h
>>>> +++ b/include/linux/pageblock-flags.h
>>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>>>     * Huge pages are a constant size, but don't exceed the maximum allocation
>>>>     * granularity.
>>>>     */
>>>> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>>>     #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>>>     #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>>   -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>>     #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>   -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>>> -#define pageblock_order		MAX_PAGE_ORDER
>>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>>> +#define pageblock_order		PAGE_BLOCK_ORDER
>>>>     #endif /* CONFIG_HUGETLB_PAGE */
>>>>   diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index e113f713b493..13a5c4f6e6b6 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>>>     	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>>>   +#
>>>> +# Select this config option from the architecture Kconfig, if available, to set
>>>> +# the max page order for physically contiguous allocations.
>>>> +#
>>>> +config ARCH_FORCE_MAX_ORDER
>>>> +	int
>>>> +
>>>> +#
>>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>>> +# include/linux/mmzone.h.
>>>> +#
>>>> +config PAGE_BLOCK_ORDER
>>>> +	int "Page Block Order"
>>>> +	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> +	default 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> +	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> +	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> +	help
>>>> +	  The page block order refers to the power of two number of pages that
>>>> +	  are physically contiguous and can have a migrate type associated to
>>>> +	  them. The maximum size of the page block order is limited by
>>>> +	  ARCH_FORCE_MAX_ORDER.
>>>> +
>>>> +	  This config allows overriding the default page block order when the
>>>> +	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>>> +	  or MAX_PAGE_ORDER.
>>>> +
>>>> +	  Reducing pageblock order can negatively impact THP generation
>>>> +	  success rate. If your workloads uses THP heavily, please use this
>>>> +	  option with caution.
>>>> +
>>>> +	  Don't change if unsure.
>>>
>>>
>>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>>
>>> CONFIG_PAGE_BLOCK_ORDER=10
>>>
>>>
>>> But then, we'll do this
>>>
>>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>
>>>
>>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>>
>>> Confusing.
>>>
>>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>>
>> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
>> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.
> 
> LIMIT might be still ambiguous, since it can be lower limit or upper limit.
> CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
> if it looks good to you, I can send it out properly.

LGTM

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-03 15:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
2025-05-28  8:21 ` Vlastimil Babka
2025-05-28 18:24   ` Andrew Morton
2025-06-03 13:03 ` David Hildenbrand
2025-06-03 14:55   ` Zi Yan
2025-06-03 15:14     ` Zi Yan
2025-06-03 15:42       ` David Hildenbrand
2025-06-03 15:20   ` Juan Yescas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox