* [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
@ 2025-05-21 21:57 Juan Yescas
2025-05-28 8:21 ` Vlastimil Babka
2025-06-03 13:03 ` David Hildenbrand
0 siblings, 2 replies; 8+ messages in thread
From: Juan Yescas @ 2025-05-21 21:57 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Juan Yescas, Zi Yan, linux-mm,
linux-kernel
Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim
Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.
This means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
For example, in ARM, the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
There are some extreme cases for the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x400000>; /* 4 MiB */
...
};
};
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x2000000>; /* 32 MiB */
...
};
};
Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
allows to set the page block order in all the architectures.
The maximum page block order will be given by
ARCH_FORCE_MAX_ORDER.
By default, CONFIG_PAGE_BLOCK_ORDER will have the same
value that ARCH_FORCE_MAX_ORDER. This will make sure that
current kernel configurations won't be affected by this
change. It is a opt-in change.
This patch will allow to have the same CMA alignment
requirements for large page sizes (16KiB, 64KiB) as that
in 4kb kernels by setting a lower pageblock_order.
Tests:
- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
on 4k and 16k kernels.
- Verified that Transparent Huge Pages work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.
- Verified that dma-buf heaps allocations work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.
Benchmarks:
The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min
CMA alignment requirement the same as that in 4kb kernels (2MB).
- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure
the # of instructions and page-faults on 16k kernels.
The benchmark was executed 10 times. The averages are below:
# instructions | #page-faults
order 10 | order 7 | order 10 | order 7
--------------------------------------------------------
13,891,765,770 | 11,425,777,314 | 220 | 217
14,456,293,487 | 12,660,819,302 | 224 | 219
13,924,261,018 | 13,243,970,736 | 217 | 221
13,910,886,504 | 13,845,519,630 | 217 | 221
14,388,071,190 | 13,498,583,098 | 223 | 224
13,656,442,167 | 12,915,831,681 | 216 | 218
13,300,268,343 | 12,930,484,776 | 222 | 218
13,625,470,223 | 14,234,092,777 | 219 | 218
13,508,964,965 | 13,432,689,094 | 225 | 219
13,368,950,667 | 13,683,587,37 | 219 | 225
-------------------------------------------------------------------
13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
There were 4.85% #instructions when order was 7, in comparison
with order 10.
13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
The number of page faults in order 7 and 10 were the same.
These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
on the 16k kernels with pageblock_order 7 and 10.
order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
15.8 | 16.4 | 0.6 | 3.80%
16.4 | 16.2 | -0.2 | -1.22%
16.6 | 16.3 | -0.3 | -1.81%
16.8 | 16.3 | -0.5 | -2.98%
16.6 | 16.8 | 0.2 | 1.20%
-------------------------------------------------------------------
16.44 16.4 -0.04 -0.24% Averages
The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
CC: Mike Rapoport <rppt@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---
Changes in v7:
- Update alignment calculation to 2MiB as per David's
observation.
- Update page block order calculation in mm/mm_init.c for
powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
Changes in v6:
- Applied the change provided by Zi Yan to fix
the Kconfig. The change consists in evaluating
to true or false in the if expression for range:
range 1 <symbol> if <expression to eval true/false>.
Changes in v5:
- Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
ranges with config definitions don't work in Kconfig,
for example (range 1 MY_CONFIG).
- Add PAGE_BLOCK_ORDER_MANUAL config for the
page block order number. The default value was not
defined.
- Fix typos reported by Andrew.
- Test default configs in powerpc.
Changes in v4:
- Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
compile time.
- This change fixes the warning in:
https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
Changes in v3:
- Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
as per Matthew's suggestion.
- Update comments in pageblock-flags.h for pageblock_order
value when THP or HugeTLB are not used.
Changes in v2:
- Add Zi's Acked-by tag.
- Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
per Zi and Matthew suggestion so it is available to
all the architectures.
- Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
ARCH_FORCE_MAX_ORDER is not available.
include/linux/mmzone.h | 16 ++++++++++++++++
include/linux/pageblock-flags.h | 8 ++++----
mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
mm/mm_init.c | 2 +-
4 files changed, 55 insertions(+), 5 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6ccec1bf2896..05610337bbb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,6 +37,22 @@
#define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
+/* Defines the order for the number of pages that have a migrate type. */
+#ifndef CONFIG_PAGE_BLOCK_ORDER
+#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#else
+#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
+#endif /* CONFIG_PAGE_BLOCK_ORDER */
+
+/*
+ * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * which defines the order for the number of pages that can have a migrate type
+ */
+#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#endif
+
/*
* PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
* costly to service. That is between allocation orders which should
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fc6b9c87cb0a..e73a4292ef02 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
* Huge pages are a constant size, but don't exceed the maximum allocation
* granularity.
*/
-#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
-#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order MAX_PAGE_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
+#define pageblock_order PAGE_BLOCK_ORDER
#endif /* CONFIG_HUGETLB_PAGE */
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..13a5c4f6e6b6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -989,6 +989,40 @@ config CMA_AREAS
If unsure, leave the default value "8" in UMA and "20" in NUMA.
+#
+# Select this config option from the architecture Kconfig, if available, to set
+# the max page order for physically contiguous allocations.
+#
+config ARCH_FORCE_MAX_ORDER
+ int
+
+#
+# When ARCH_FORCE_MAX_ORDER is not defined,
+# the default page block order is MAX_PAGE_ORDER (10) as per
+# include/linux/mmzone.h.
+#
+config PAGE_BLOCK_ORDER
+ int "Page Block Order"
+ range 1 10 if ARCH_FORCE_MAX_ORDER = 0
+ default 10 if ARCH_FORCE_MAX_ORDER = 0
+ range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+ default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+ help
+ The page block order refers to the power of two number of pages that
+ are physically contiguous and can have a migrate type associated to
+ them. The maximum size of the page block order is limited by
+ ARCH_FORCE_MAX_ORDER.
+
+ This config allows overriding the default page block order when the
+ page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
+ or MAX_PAGE_ORDER.
+
+ Reducing pageblock order can negatively impact THP generation
+ success rate. If your workloads uses THP heavily, please use this
+ option with caution.
+
+ Don't change if unsure.
+
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 327764ca0ee4..ada5374764e4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1511,7 +1511,7 @@ static inline void setup_usemap(struct zone *zone) {}
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
{
- unsigned int order = MAX_PAGE_ORDER;
+ unsigned int order = PAGE_BLOCK_ORDER;
/* Check that pageblock_nr_pages has not already been setup */
if (pageblock_order)
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
@ 2025-05-28 8:21 ` Vlastimil Babka
2025-05-28 18:24 ` Andrew Morton
2025-06-03 13:03 ` David Hildenbrand
1 sibling, 1 reply; 8+ messages in thread
From: Vlastimil Babka @ 2025-05-28 8:21 UTC (permalink / raw)
To: Juan Yescas, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, linux-mm, linux-kernel
Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim
On 5/21/25 23:57, Juan Yescas wrote:
> Problem: On large page size configurations (16KiB, 64KiB), the CMA
> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> and this causes the CMA reservations to be larger than necessary.
> This means that system will have less available MIGRATE_UNMOVABLE and
> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>
> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>
> For example, in ARM, the CMA alignment requirement when:
>
> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> -----------------------------------------------------------------------
> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>
> There are some extreme cases for the CMA alignment requirement when:
>
> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> - CONFIG_HUGETLB_PAGE is NOT set
>
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> ------------------------------------------------------------------------
> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>
> This affects the CMA reservations for the drivers. If a driver in a
> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> reservation has to be 32MiB due to the alignment requirements:
>
> reserved-memory {
> ...
> cma_test_reserve: cma_test_reserve {
> compatible = "shared-dma-pool";
> size = <0x0 0x400000>; /* 4 MiB */
> ...
> };
> };
>
> reserved-memory {
> ...
> cma_test_reserve: cma_test_reserve {
> compatible = "shared-dma-pool";
> size = <0x0 0x2000000>; /* 32 MiB */
> ...
> };
> };
>
> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> allows to set the page block order in all the architectures.
> The maximum page block order will be given by
> ARCH_FORCE_MAX_ORDER.
>
> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> value that ARCH_FORCE_MAX_ORDER. This will make sure that
> current kernel configurations won't be affected by this
> change. It is a opt-in change.
>
> This patch will allow to have the same CMA alignment
> requirements for large page sizes (16KiB, 64KiB) as that
> in 4kb kernels by setting a lower pageblock_order.
>
> Tests:
>
> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> on 4k and 16k kernels.
>
> - Verified that Transparent Huge Pages work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
>
> - Verified that dma-buf heaps allocations work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
>
> Benchmarks:
>
> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> reason for the pageblock_order 7 is because this value makes the min
> CMA alignment requirement the same as that in 4kb kernels (2MB).
>
> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> (https://developer.android.com/ndk/guides/simpleperf) to measure
> the # of instructions and page-faults on 16k kernels.
> The benchmark was executed 10 times. The averages are below:
>
> # instructions | #page-faults
> order 10 | order 7 | order 10 | order 7
> --------------------------------------------------------
> 13,891,765,770 | 11,425,777,314 | 220 | 217
> 14,456,293,487 | 12,660,819,302 | 224 | 219
> 13,924,261,018 | 13,243,970,736 | 217 | 221
> 13,910,886,504 | 13,845,519,630 | 217 | 221
> 14,388,071,190 | 13,498,583,098 | 223 | 224
> 13,656,442,167 | 12,915,831,681 | 216 | 218
> 13,300,268,343 | 12,930,484,776 | 222 | 218
> 13,625,470,223 | 14,234,092,777 | 219 | 218
> 13,508,964,965 | 13,432,689,094 | 225 | 219
> 13,368,950,667 | 13,683,587,37 | 219 | 225
> -------------------------------------------------------------------
> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>
> There were 4.85% #instructions when order was 7, in comparison
> with order 10.
>
> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>
> The number of page faults in order 7 and 10 were the same.
>
> These results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
>
> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
> on the 16k kernels with pageblock_order 7 and 10.
>
> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
> -------------------------------------------------------------------
> 15.8 | 16.4 | 0.6 | 3.80%
> 16.4 | 16.2 | -0.2 | -1.22%
> 16.6 | 16.3 | -0.3 | -1.81%
> 16.8 | 16.3 | -0.5 | -2.98%
> 16.6 | 16.8 | 0.2 | 1.20%
> -------------------------------------------------------------------
> 16.44 16.4 -0.04 -0.24% Averages
>
> The results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> CC: Mike Rapoport <rppt@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Juan Yescas <jyescas@google.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-05-28 8:21 ` Vlastimil Babka
@ 2025-05-28 18:24 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2025-05-28 18:24 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Juan Yescas, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
masahiroy, Minchan Kim
On Wed, 28 May 2025 10:21:46 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Great, thanks. I'll move this patch into mm-stable. I'll be sending
two MM pull requests to Linus this cycle. The below patches will be in
the second batch, next week.
#
m68k-remove-use-of-page-index.patch
mm-rename-page-index-to-page-__folio_index.patch
#
ntfs3-use-folios-more-in-ntfs_compress_write.patch
iov-remove-copy_page_from_iter_atomic.patch
#
zram-rename-zcomp_param_no_level.patch
zram-support-deflate-specific-params.patch
#
selftests-mm-deduplicate-test-logging-in-test_mlock_lock.patch
#
selftests-mm-deduplicate-default-page-size-test-results-in-thuge-gen.patch
#
memcg-disable-kmem-charging-in-nmi-for-unsupported-arch.patch
memcg-nmi-safe-memcg-stats-for-specific-archs.patch
memcg-add-nmi-safe-update-for-memcg_kmem.patch
memcg-nmi-safe-slab-stats-updates.patch
memcg-make-memcg_rstat_updated-nmi-safe.patch
#
mm-damon-core-avoid-destroyed-target-reference-from-damos-quota.patch
#
mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch
mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch
mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch
mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch
mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch
#
selftests-mm-skip-guard_regionsuffd-tests-when-uffd-is-not-present.patch
selftests-mm-skip-hugevm-test-if-kernel-config-file-is-not-present.patch
#
hugetlb-show-nr_huge_pages-in-report_hugepages.patch
#
#
mm-damon-kconfig-set-damon_vaddrpaddrsysfs-default-to-damon.patch
mm-damon-kconfig-enable-config_damon-by-default.patch
#
mmu_gather-move-tlb-flush-for-vm_pfnmap-vm_mixedmap-vmas-into-free_pgtables.patch
#
mm-rust-make-config_mmu-ifdefs-more-narrow.patch
#
kcov-rust-add-flags-for-kcov-with-rust.patch
#
#
selftests-mm-deduplicate-test-names-in-madv_populate.patch
#
mmu_notifiers-remove-leftover-stub-macros.patch
#
mm-add-config_page_block_order-to-select-page-block-order.patch
#
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
2025-05-28 8:21 ` Vlastimil Babka
@ 2025-06-03 13:03 ` David Hildenbrand
2025-06-03 14:55 ` Zi Yan
2025-06-03 15:20 ` Juan Yescas
1 sibling, 2 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-03 13:03 UTC (permalink / raw)
To: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, linux-mm, linux-kernel
Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim
On 21.05.25 23:57, Juan Yescas wrote:
> Problem: On large page size configurations (16KiB, 64KiB), the CMA
> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> and this causes the CMA reservations to be larger than necessary.
> This means that system will have less available MIGRATE_UNMOVABLE and
> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>
> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>
> For example, in ARM, the CMA alignment requirement when:
>
> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> -----------------------------------------------------------------------
> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>
> There are some extreme cases for the CMA alignment requirement when:
>
> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> - CONFIG_HUGETLB_PAGE is NOT set
>
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> ------------------------------------------------------------------------
> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>
> This affects the CMA reservations for the drivers. If a driver in a
> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> reservation has to be 32MiB due to the alignment requirements:
>
> reserved-memory {
> ...
> cma_test_reserve: cma_test_reserve {
> compatible = "shared-dma-pool";
> size = <0x0 0x400000>; /* 4 MiB */
> ...
> };
> };
>
> reserved-memory {
> ...
> cma_test_reserve: cma_test_reserve {
> compatible = "shared-dma-pool";
> size = <0x0 0x2000000>; /* 32 MiB */
> ...
> };
> };
>
> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> allows to set the page block order in all the architectures.
> The maximum page block order will be given by
> ARCH_FORCE_MAX_ORDER.
>
> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> value that ARCH_FORCE_MAX_ORDER. This will make sure that
> current kernel configurations won't be affected by this
> change. It is a opt-in change.
>
> This patch will allow to have the same CMA alignment
> requirements for large page sizes (16KiB, 64KiB) as that
> in 4kb kernels by setting a lower pageblock_order.
>
> Tests:
>
> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> on 4k and 16k kernels.
>
> - Verified that Transparent Huge Pages work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
>
> - Verified that dma-buf heaps allocations work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
>
> Benchmarks:
>
> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> reason for the pageblock_order 7 is because this value makes the min
> CMA alignment requirement the same as that in 4kb kernels (2MB).
>
> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> (https://developer.android.com/ndk/guides/simpleperf) to measure
> the # of instructions and page-faults on 16k kernels.
> The benchmark was executed 10 times. The averages are below:
>
> # instructions | #page-faults
> order 10 | order 7 | order 10 | order 7
> --------------------------------------------------------
> 13,891,765,770 | 11,425,777,314 | 220 | 217
> 14,456,293,487 | 12,660,819,302 | 224 | 219
> 13,924,261,018 | 13,243,970,736 | 217 | 221
> 13,910,886,504 | 13,845,519,630 | 217 | 221
> 14,388,071,190 | 13,498,583,098 | 223 | 224
> 13,656,442,167 | 12,915,831,681 | 216 | 218
> 13,300,268,343 | 12,930,484,776 | 222 | 218
> 13,625,470,223 | 14,234,092,777 | 219 | 218
> 13,508,964,965 | 13,432,689,094 | 225 | 219
> 13,368,950,667 | 13,683,587,37 | 219 | 225
> -------------------------------------------------------------------
> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>
> There were 4.85% #instructions when order was 7, in comparison
> with order 10.
>
> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>
> The number of page faults in order 7 and 10 were the same.
>
> These results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
>
> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
> on the 16k kernels with pageblock_order 7 and 10.
>
> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
> -------------------------------------------------------------------
> 15.8 | 16.4 | 0.6 | 3.80%
> 16.4 | 16.2 | -0.2 | -1.22%
> 16.6 | 16.3 | -0.3 | -1.81%
> 16.8 | 16.3 | -0.5 | -2.98%
> 16.6 | 16.8 | 0.2 | 1.20%
> -------------------------------------------------------------------
> 16.44 16.4 -0.04 -0.24% Averages
>
> The results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> CC: Mike Rapoport <rppt@kernel.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Juan Yescas <jyescas@google.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> ---
> Changes in v7:
> - Update alignment calculation to 2MiB as per David's
> observation.
> - Update page block order calculation in mm/mm_init.c for
> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>
> Changes in v6:
> - Applied the change provided by Zi Yan to fix
> the Kconfig. The change consists in evaluating
> to true or false in the if expression for range:
> range 1 <symbol> if <expression to eval true/false>.
>
> Changes in v5:
> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
> ranges with config definitions don't work in Kconfig,
> for example (range 1 MY_CONFIG).
> - Add PAGE_BLOCK_ORDER_MANUAL config for the
> page block order number. The default value was not
> defined.
> - Fix typos reported by Andrew.
> - Test default configs in powerpc.
>
> Changes in v4:
> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
> compile time.
> - This change fixes the warning in:
> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>
> Changes in v3:
> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
> as per Matthew's suggestion.
> - Update comments in pageblock-flags.h for pageblock_order
> value when THP or HugeTLB are not used.
>
> Changes in v2:
> - Add Zi's Acked-by tag.
> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
> per Zi and Matthew suggestion so it is available to
> all the architectures.
> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
> ARCH_FORCE_MAX_ORDER is not available.
>
> include/linux/mmzone.h | 16 ++++++++++++++++
> include/linux/pageblock-flags.h | 8 ++++----
> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
> mm/mm_init.c | 2 +-
> 4 files changed, 55 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6ccec1bf2896..05610337bbb6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -37,6 +37,22 @@
>
> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>
> +/* Defines the order for the number of pages that have a migrate type. */
> +#ifndef CONFIG_PAGE_BLOCK_ORDER
> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> +#else
> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> +
> +/*
> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> + * which defines the order for the number of pages that can have a migrate type
> + */
> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> +#endif
> +
> /*
> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> * costly to service. That is between allocation orders which should
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index fc6b9c87cb0a..e73a4292ef02 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
> * Huge pages are a constant size, but don't exceed the maximum allocation
> * granularity.
> */
> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>
> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order MAX_PAGE_ORDER
> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
> +#define pageblock_order PAGE_BLOCK_ORDER
>
> #endif /* CONFIG_HUGETLB_PAGE */
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e113f713b493..13a5c4f6e6b6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -989,6 +989,40 @@ config CMA_AREAS
>
> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>
> +#
> +# Select this config option from the architecture Kconfig, if available, to set
> +# the max page order for physically contiguous allocations.
> +#
> +config ARCH_FORCE_MAX_ORDER
> + int
> +
> +#
> +# When ARCH_FORCE_MAX_ORDER is not defined,
> +# the default page block order is MAX_PAGE_ORDER (10) as per
> +# include/linux/mmzone.h.
> +#
> +config PAGE_BLOCK_ORDER
> + int "Page Block Order"
> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
> + default 10 if ARCH_FORCE_MAX_ORDER = 0
> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> + help
> + The page block order refers to the power of two number of pages that
> + are physically contiguous and can have a migrate type associated to
> + them. The maximum size of the page block order is limited by
> + ARCH_FORCE_MAX_ORDER.
> +
> + This config allows overriding the default page block order when the
> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
> + or MAX_PAGE_ORDER.
> +
> + Reducing pageblock order can negatively impact THP generation
> + success rate. If your workloads uses THP heavily, please use this
> + option with caution.
> +
> + Don't change if unsure.
The semantics are now very confusing [1]. The default in x86-64 will be
10, so we'll have
CONFIG_PAGE_BLOCK_ORDER=10
But then, we'll do this
#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER,
PAGE_BLOCK_ORDER)
So the actual pageblock order will be different than
CONFIG_PAGE_BLOCK_ORDER.
Confusing.
Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL
? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
[1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-06-03 13:03 ` David Hildenbrand
@ 2025-06-03 14:55 ` Zi Yan
2025-06-03 15:14 ` Zi Yan
2025-06-03 15:20 ` Juan Yescas
1 sibling, 1 reply; 8+ messages in thread
From: Zi Yan @ 2025-06-03 14:55 UTC (permalink / raw)
To: David Hildenbrand
Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
masahiroy, Minchan Kim
On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
> On 21.05.25 23:57, Juan Yescas wrote:
>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>> and this causes the CMA reservations to be larger than necessary.
>> This means that system will have less available MIGRATE_UNMOVABLE and
>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>
>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>
>> For example, in ARM, the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>> -----------------------------------------------------------------------
>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>
>> There are some extreme cases for the CMA alignment requirement when:
>>
>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>> - CONFIG_HUGETLB_PAGE is NOT set
>>
>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>> ------------------------------------------------------------------------
>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>
>> This affects the CMA reservations for the drivers. If a driver in a
>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>> reservation has to be 32MiB due to the alignment requirements:
>>
>> reserved-memory {
>> ...
>> cma_test_reserve: cma_test_reserve {
>> compatible = "shared-dma-pool";
>> size = <0x0 0x400000>; /* 4 MiB */
>> ...
>> };
>> };
>>
>> reserved-memory {
>> ...
>> cma_test_reserve: cma_test_reserve {
>> compatible = "shared-dma-pool";
>> size = <0x0 0x2000000>; /* 32 MiB */
>> ...
>> };
>> };
>>
>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>> allows to set the page block order in all the architectures.
>> The maximum page block order will be given by
>> ARCH_FORCE_MAX_ORDER.
>>
>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>> current kernel configurations won't be affected by this
>> change. It is a opt-in change.
>>
>> This patch will allow to have the same CMA alignment
>> requirements for large page sizes (16KiB, 64KiB) as that
>> in 4kb kernels by setting a lower pageblock_order.
>>
>> Tests:
>>
>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>> on 4k and 16k kernels.
>>
>> - Verified that Transparent Huge Pages work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>>
>> - Verified that dma-buf heaps allocations work when pageblock_order
>> is 1, 7, 10 on 4k and 16k kernels.
>>
>> Benchmarks:
>>
>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>> reason for the pageblock_order 7 is because this value makes the min
>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>
>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>> the # of instructions and page-faults on 16k kernels.
>> The benchmark was executed 10 times. The averages are below:
>>
>> # instructions | #page-faults
>> order 10 | order 7 | order 10 | order 7
>> --------------------------------------------------------
>> 13,891,765,770 | 11,425,777,314 | 220 | 217
>> 14,456,293,487 | 12,660,819,302 | 224 | 219
>> 13,924,261,018 | 13,243,970,736 | 217 | 221
>> 13,910,886,504 | 13,845,519,630 | 217 | 221
>> 14,388,071,190 | 13,498,583,098 | 223 | 224
>> 13,656,442,167 | 12,915,831,681 | 216 | 218
>> 13,300,268,343 | 12,930,484,776 | 222 | 218
>> 13,625,470,223 | 14,234,092,777 | 219 | 218
>> 13,508,964,965 | 13,432,689,094 | 225 | 219
>> 13,368,950,667 | 13,683,587,37 | 219 | 225
>> -------------------------------------------------------------------
>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>>
>> There were 4.85% #instructions when order was 7, in comparison
>> with order 10.
>>
>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>
>> The number of page faults in order 7 and 10 were the same.
>>
>> These results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>> on the 16k kernels with pageblock_order 7 and 10.
>>
>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
>> -------------------------------------------------------------------
>> 15.8 | 16.4 | 0.6 | 3.80%
>> 16.4 | 16.2 | -0.2 | -1.22%
>> 16.6 | 16.3 | -0.3 | -1.81%
>> 16.8 | 16.3 | -0.5 | -2.98%
>> 16.6 | 16.8 | 0.2 | 1.20%
>> -------------------------------------------------------------------
>> 16.44 16.4 -0.04 -0.24% Averages
>>
>> The results didn't show any significant regression when the
>> pageblock_order is set to 7 on 16kb kernels.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> CC: Mike Rapoport <rppt@kernel.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Signed-off-by: Juan Yescas <jyescas@google.com>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> ---
>> Changes in v7:
>> - Update alignment calculation to 2MiB as per David's
>> observation.
>> - Update page block order calculation in mm/mm_init.c for
>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>
>> Changes in v6:
>> - Applied the change provided by Zi Yan to fix
>> the Kconfig. The change consists in evaluating
>> to true or false in the if expression for range:
>> range 1 <symbol> if <expression to eval true/false>.
>>
>> Changes in v5:
>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>> ranges with config definitions don't work in Kconfig,
>> for example (range 1 MY_CONFIG).
>> - Add PAGE_BLOCK_ORDER_MANUAL config for the
>> page block order number. The default value was not
>> defined.
>> - Fix typos reported by Andrew.
>> - Test default configs in powerpc.
>>
>> Changes in v4:
>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>> compile time.
>> - This change fixes the warning in:
>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>
>> Changes in v3:
>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>> as per Matthew's suggestion.
>> - Update comments in pageblock-flags.h for pageblock_order
>> value when THP or HugeTLB are not used.
>>
>> Changes in v2:
>> - Add Zi's Acked-by tag.
>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>> per Zi and Matthew suggestion so it is available to
>> all the architectures.
>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>> ARCH_FORCE_MAX_ORDER is not available.
>>
>> include/linux/mmzone.h | 16 ++++++++++++++++
>> include/linux/pageblock-flags.h | 8 ++++----
>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
>> mm/mm_init.c | 2 +-
>> 4 files changed, 55 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 6ccec1bf2896..05610337bbb6 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -37,6 +37,22 @@
>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>> +/* Defines the order for the number of pages that have a migrate type. */
>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>> +#else
>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>> +
>> +/*
>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>> + * which defines the order for the number of pages that can have a migrate type
>> + */
>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>> +#endif
>> +
>> /*
>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>> * costly to service. That is between allocation orders which should
>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>> index fc6b9c87cb0a..e73a4292ef02 100644
>> --- a/include/linux/pageblock-flags.h
>> +++ b/include/linux/pageblock-flags.h
>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>> * Huge pages are a constant size, but don't exceed the maximum allocation
>> * granularity.
>> */
>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>> -#define pageblock_order MAX_PAGE_ORDER
>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>> +#define pageblock_order PAGE_BLOCK_ORDER
>> #endif /* CONFIG_HUGETLB_PAGE */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e113f713b493..13a5c4f6e6b6 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -989,6 +989,40 @@ config CMA_AREAS
>> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>> +#
>> +# Select this config option from the architecture Kconfig, if available, to set
>> +# the max page order for physically contiguous allocations.
>> +#
>> +config ARCH_FORCE_MAX_ORDER
>> + int
>> +
>> +#
>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>> +# include/linux/mmzone.h.
>> +#
>> +config PAGE_BLOCK_ORDER
>> + int "Page Block Order"
>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>> + default 10 if ARCH_FORCE_MAX_ORDER = 0
>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>> + help
>> + The page block order refers to the power of two number of pages that
>> + are physically contiguous and can have a migrate type associated to
>> + them. The maximum size of the page block order is limited by
>> + ARCH_FORCE_MAX_ORDER.
>> +
>> + This config allows overriding the default page block order when the
>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>> + or MAX_PAGE_ORDER.
>> +
>> + Reducing pageblock order can negatively impact THP generation
>> + success rate. If your workloads uses THP heavily, please use this
>> + option with caution.
>> +
>> + Don't change if unsure.
>
>
> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>
> CONFIG_PAGE_BLOCK_ORDER=10
>
>
> But then, we'll do this
>
> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>
>
> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>
> Confusing.
>
> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.
>
> [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928
>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-06-03 14:55 ` Zi Yan
@ 2025-06-03 15:14 ` Zi Yan
2025-06-03 15:42 ` David Hildenbrand
0 siblings, 1 reply; 8+ messages in thread
From: Zi Yan @ 2025-06-03 15:14 UTC (permalink / raw)
To: David Hildenbrand
Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
masahiroy, Minchan Kim
On 3 Jun 2025, at 10:55, Zi Yan wrote:
> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>
>> On 21.05.25 23:57, Juan Yescas wrote:
>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>> and this causes the CMA reservations to be larger than necessary.
>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>
>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>
>>> For example, in ARM, the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>> -----------------------------------------------------------------------
>>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
>>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> There are some extreme cases for the CMA alignment requirement when:
>>>
>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>
>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>> ------------------------------------------------------------------------
>>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
>>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>
>>> This affects the CMA reservations for the drivers. If a driver in a
>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>> reservation has to be 32MiB due to the alignment requirements:
>>>
>>> reserved-memory {
>>> ...
>>> cma_test_reserve: cma_test_reserve {
>>> compatible = "shared-dma-pool";
>>> size = <0x0 0x400000>; /* 4 MiB */
>>> ...
>>> };
>>> };
>>>
>>> reserved-memory {
>>> ...
>>> cma_test_reserve: cma_test_reserve {
>>> compatible = "shared-dma-pool";
>>> size = <0x0 0x2000000>; /* 32 MiB */
>>> ...
>>> };
>>> };
>>>
>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>> allows to set the page block order in all the architectures.
>>> The maximum page block order will be given by
>>> ARCH_FORCE_MAX_ORDER.
>>>
>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>> current kernel configurations won't be affected by this
>>> change. It is a opt-in change.
>>>
>>> This patch will allow to have the same CMA alignment
>>> requirements for large page sizes (16KiB, 64KiB) as that
>>> in 4kb kernels by setting a lower pageblock_order.
>>>
>>> Tests:
>>>
>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>> on 4k and 16k kernels.
>>>
>>> - Verified that Transparent Huge Pages work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>> is 1, 7, 10 on 4k and 16k kernels.
>>>
>>> Benchmarks:
>>>
>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>> reason for the pageblock_order 7 is because this value makes the min
>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>
>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>> the # of instructions and page-faults on 16k kernels.
>>> The benchmark was executed 10 times. The averages are below:
>>>
>>> # instructions | #page-faults
>>> order 10 | order 7 | order 10 | order 7
>>> --------------------------------------------------------
>>> 13,891,765,770 | 11,425,777,314 | 220 | 217
>>> 14,456,293,487 | 12,660,819,302 | 224 | 219
>>> 13,924,261,018 | 13,243,970,736 | 217 | 221
>>> 13,910,886,504 | 13,845,519,630 | 217 | 221
>>> 14,388,071,190 | 13,498,583,098 | 223 | 224
>>> 13,656,442,167 | 12,915,831,681 | 216 | 218
>>> 13,300,268,343 | 12,930,484,776 | 222 | 218
>>> 13,625,470,223 | 14,234,092,777 | 219 | 218
>>> 13,508,964,965 | 13,432,689,094 | 225 | 219
>>> 13,368,950,667 | 13,683,587,37 | 219 | 225
>>> -------------------------------------------------------------------
>>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>>>
>>> There were 4.85% #instructions when order was 7, in comparison
>>> with order 10.
>>>
>>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>
>>> The number of page faults in order 7 and 10 were the same.
>>>
>>> These results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>> on the 16k kernels with pageblock_order 7 and 10.
>>>
>>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
>>> -------------------------------------------------------------------
>>> 15.8 | 16.4 | 0.6 | 3.80%
>>> 16.4 | 16.2 | -0.2 | -1.22%
>>> 16.6 | 16.3 | -0.3 | -1.81%
>>> 16.8 | 16.3 | -0.5 | -2.98%
>>> 16.6 | 16.8 | 0.2 | 1.20%
>>> -------------------------------------------------------------------
>>> 16.44 16.4 -0.04 -0.24% Averages
>>>
>>> The results didn't show any significant regression when the
>>> pageblock_order is set to 7 on 16kb kernels.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> CC: Mike Rapoport <rppt@kernel.org>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Signed-off-by: Juan Yescas <jyescas@google.com>
>>> Acked-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> Changes in v7:
>>> - Update alignment calculation to 2MiB as per David's
>>> observation.
>>> - Update page block order calculation in mm/mm_init.c for
>>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>
>>> Changes in v6:
>>> - Applied the change provided by Zi Yan to fix
>>> the Kconfig. The change consists in evaluating
>>> to true or false in the if expression for range:
>>> range 1 <symbol> if <expression to eval true/false>.
>>>
>>> Changes in v5:
>>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>> ranges with config definitions don't work in Kconfig,
>>> for example (range 1 MY_CONFIG).
>>> - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>> page block order number. The default value was not
>>> defined.
>>> - Fix typos reported by Andrew.
>>> - Test default configs in powerpc.
>>>
>>> Changes in v4:
>>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>> compile time.
>>> - This change fixes the warning in:
>>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>>
>>> Changes in v3:
>>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>> as per Matthew's suggestion.
>>> - Update comments in pageblock-flags.h for pageblock_order
>>> value when THP or HugeTLB are not used.
>>>
>>> Changes in v2:
>>> - Add Zi's Acked-by tag.
>>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>> per Zi and Matthew suggestion so it is available to
>>> all the architectures.
>>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>> ARCH_FORCE_MAX_ORDER is not available.
>>>
>>> include/linux/mmzone.h | 16 ++++++++++++++++
>>> include/linux/pageblock-flags.h | 8 ++++----
>>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
>>> mm/mm_init.c | 2 +-
>>> 4 files changed, 55 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 6ccec1bf2896..05610337bbb6 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -37,6 +37,22 @@
>>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>> +/* Defines the order for the number of pages that have a migrate type. */
>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>> +#else
>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>> +
>>> +/*
>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>> + * which defines the order for the number of pages that can have a migrate type
>>> + */
>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>> +#endif
>>> +
>>> /*
>>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>> * costly to service. That is between allocation orders which should
>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>> --- a/include/linux/pageblock-flags.h
>>> +++ b/include/linux/pageblock-flags.h
>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>> * Huge pages are a constant size, but don't exceed the maximum allocation
>>> * granularity.
>>> */
>>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>> -#define pageblock_order MAX_PAGE_ORDER
>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>> +#define pageblock_order PAGE_BLOCK_ORDER
>>> #endif /* CONFIG_HUGETLB_PAGE */
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index e113f713b493..13a5c4f6e6b6 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>> +#
>>> +# Select this config option from the architecture Kconfig, if available, to set
>>> +# the max page order for physically contiguous allocations.
>>> +#
>>> +config ARCH_FORCE_MAX_ORDER
>>> + int
>>> +
>>> +#
>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>> +# include/linux/mmzone.h.
>>> +#
>>> +config PAGE_BLOCK_ORDER
>>> + int "Page Block Order"
>>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>> + default 10 if ARCH_FORCE_MAX_ORDER = 0
>>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>> + help
>>> + The page block order refers to the power of two number of pages that
>>> + are physically contiguous and can have a migrate type associated to
>>> + them. The maximum size of the page block order is limited by
>>> + ARCH_FORCE_MAX_ORDER.
>>> +
>>> + This config allows overriding the default page block order when the
>>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>> + or MAX_PAGE_ORDER.
>>> +
>>> + Reducing pageblock order can negatively impact THP generation
>>> + success rate. If your workloads uses THP heavily, please use this
>>> + option with caution.
>>> +
>>> + Don't change if unsure.
>>
>>
>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>
>> CONFIG_PAGE_BLOCK_ORDER=10
>>
>>
>> But then, we'll do this
>>
>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>
>>
>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>
>> Confusing.
>>
>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>
> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.
LIMIT might be still ambiguous, since it can be lower limit or upper limit.
CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
if it looks good to you, I can send it out properly.
From 7fff4fd87ed3aa160db8d2f0d9e5b219321df4f9 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 3 Jun 2025 11:09:37 -0400
Subject: [PATCH] mm: rename CONFIG_PAGE_BLOCK_ORDER to
CONFIG_PAGE_BLOCK_ORDER_CEIL.
The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
include/linux/mmzone.h | 14 +++++++-------
include/linux/pageblock-flags.h | 8 ++++----
mm/Kconfig | 15 ++++++++-------
3 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 283913d42d7b..523b407e63e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,19 +38,19 @@
#define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
/* Defines the order for the number of pages that have a migrate type. */
-#ifndef CONFIG_PAGE_BLOCK_ORDER
-#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#ifndef CONFIG_PAGE_BLOCK_ORDER_CEIL
+#define PAGE_BLOCK_ORDER_CEIL MAX_PAGE_ORDER
#else
-#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
-#endif /* CONFIG_PAGE_BLOCK_ORDER */
+#define PAGE_BLOCK_ORDER_CEIL CONFIG_PAGE_BLOCK_ORDER_CEIL
+#endif /* CONFIG_PAGE_BLOCK_ORDER_CEIL */
/*
* The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
- * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER_CEIL,
* which defines the order for the number of pages that can have a migrate type
*/
-#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
-#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#if (PAGE_BLOCK_ORDER_CEIL > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER_CEIL
#endif
/*
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e73a4292ef02..e7a86cd238c2 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
* Huge pages are a constant size, but don't exceed the maximum allocation
* granularity.
*/
-#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER_CEIL)
#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
-#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
+#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER_CEIL)
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
-#define pageblock_order PAGE_BLOCK_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER_CEIL */
+#define pageblock_order PAGE_BLOCK_ORDER_CEIL
#endif /* CONFIG_HUGETLB_PAGE */
diff --git a/mm/Kconfig b/mm/Kconfig
index eccb2e46ffcb..3b27e644bd1f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1017,8 +1017,8 @@ config ARCH_FORCE_MAX_ORDER
# the default page block order is MAX_PAGE_ORDER (10) as per
# include/linux/mmzone.h.
#
-config PAGE_BLOCK_ORDER
- int "Page Block Order"
+config PAGE_BLOCK_ORDER_CEIL
+ int "Page Block Order Upper Limit"
range 1 10 if ARCH_FORCE_MAX_ORDER = 0
default 10 if ARCH_FORCE_MAX_ORDER = 0
range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
@@ -1026,12 +1026,13 @@ config PAGE_BLOCK_ORDER
help
The page block order refers to the power of two number of pages that
are physically contiguous and can have a migrate type associated to
- them. The maximum size of the page block order is limited by
- ARCH_FORCE_MAX_ORDER.
+ them. The maximum size of the page block order is at least limited by
+ ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER.
- This config allows overriding the default page block order when the
- page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
- or MAX_PAGE_ORDER.
+ This config adds a new upper limit of default page block
+ order when the page block order is required to be smaller than
+ ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER or other limits
+ (see include/linux/pageblock-flags.h for details).
Reducing pageblock order can negatively impact THP generation
success rate. If your workloads uses THP heavily, please use this
--
2.47.2
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-06-03 13:03 ` David Hildenbrand
2025-06-03 14:55 ` Zi Yan
@ 2025-06-03 15:20 ` Juan Yescas
1 sibling, 0 replies; 8+ messages in thread
From: Juan Yescas @ 2025-06-03 15:20 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
masahiroy, Minchan Kim
On Tue, Jun 3, 2025 at 6:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.05.25 23:57, Juan Yescas wrote:
> > Problem: On large page size configurations (16KiB, 64KiB), the CMA
> > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> > and this causes the CMA reservations to be larger than necessary.
> > This means that system will have less available MIGRATE_UNMOVABLE and
> > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> >
> > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> >
> > For example, in ARM, the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> > - CONFIG_TRANSPARENT_HUGEPAGE is set:
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> > -----------------------------------------------------------------------
> > 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
> > 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
> > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
> >
> > There are some extreme cases for the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> > - CONFIG_HUGETLB_PAGE is NOT set
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> > ------------------------------------------------------------------------
> > 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
> > 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
> > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
> >
> > This affects the CMA reservations for the drivers. If a driver in a
> > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> > reservation has to be 32MiB due to the alignment requirements:
> >
> > reserved-memory {
> > ...
> > cma_test_reserve: cma_test_reserve {
> > compatible = "shared-dma-pool";
> > size = <0x0 0x400000>; /* 4 MiB */
> > ...
> > };
> > };
> >
> > reserved-memory {
> > ...
> > cma_test_reserve: cma_test_reserve {
> > compatible = "shared-dma-pool";
> > size = <0x0 0x2000000>; /* 32 MiB */
> > ...
> > };
> > };
> >
> > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> > allows to set the page block order in all the architectures.
> > The maximum page block order will be given by
> > ARCH_FORCE_MAX_ORDER.
> >
> > By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> > value that ARCH_FORCE_MAX_ORDER. This will make sure that
> > current kernel configurations won't be affected by this
> > change. It is a opt-in change.
> >
> > This patch will allow to have the same CMA alignment
> > requirements for large page sizes (16KiB, 64KiB) as that
> > in 4kb kernels by setting a lower pageblock_order.
> >
> > Tests:
> >
> > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> > on 4k and 16k kernels.
> >
> > - Verified that Transparent Huge Pages work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > - Verified that dma-buf heaps allocations work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > Benchmarks:
> >
> > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> > reason for the pageblock_order 7 is because this value makes the min
> > CMA alignment requirement the same as that in 4kb kernels (2MB).
> >
> > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> > (https://developer.android.com/ndk/guides/simpleperf) to measure
> > the # of instructions and page-faults on 16k kernels.
> > The benchmark was executed 10 times. The averages are below:
> >
> > # instructions | #page-faults
> > order 10 | order 7 | order 10 | order 7
> > --------------------------------------------------------
> > 13,891,765,770 | 11,425,777,314 | 220 | 217
> > 14,456,293,487 | 12,660,819,302 | 224 | 219
> > 13,924,261,018 | 13,243,970,736 | 217 | 221
> > 13,910,886,504 | 13,845,519,630 | 217 | 221
> > 14,388,071,190 | 13,498,583,098 | 223 | 224
> > 13,656,442,167 | 12,915,831,681 | 216 | 218
> > 13,300,268,343 | 12,930,484,776 | 222 | 218
> > 13,625,470,223 | 14,234,092,777 | 219 | 218
> > 13,508,964,965 | 13,432,689,094 | 225 | 219
> > 13,368,950,667 | 13,683,587,37 | 219 | 225
> > -------------------------------------------------------------------
> > 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
> >
> > There were 4.85% #instructions when order was 7, in comparison
> > with order 10.
> >
> > 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> >
> > The number of page faults in order 7 and 10 were the same.
> >
> > These results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
> > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
> > on the 16k kernels with pageblock_order 7 and 10.
> >
> > order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
> > -------------------------------------------------------------------
> > 15.8 | 16.4 | 0.6 | 3.80%
> > 16.4 | 16.2 | -0.2 | -1.22%
> > 16.6 | 16.3 | -0.3 | -1.81%
> > 16.8 | 16.3 | -0.5 | -2.98%
> > 16.6 | 16.8 | 0.2 | 1.20%
> > -------------------------------------------------------------------
> > 16.44 16.4 -0.04 -0.24% Averages
> >
> > The results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > CC: Mike Rapoport <rppt@kernel.org>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Juan Yescas <jyescas@google.com>
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > ---
> > Changes in v7:
> > - Update alignment calculation to 2MiB as per David's
> > observation.
> > - Update page block order calculation in mm/mm_init.c for
> > powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
> >
> > Changes in v6:
> > - Applied the change provided by Zi Yan to fix
> > the Kconfig. The change consists in evaluating
> > to true or false in the if expression for range:
> > range 1 <symbol> if <expression to eval true/false>.
> >
> > Changes in v5:
> > - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
> > ranges with config definitions don't work in Kconfig,
> > for example (range 1 MY_CONFIG).
> > - Add PAGE_BLOCK_ORDER_MANUAL config for the
> > page block order number. The default value was not
> > defined.
> > - Fix typos reported by Andrew.
> > - Test default configs in powerpc.
> >
> > Changes in v4:
> > - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
> > validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
> > compile time.
> > - This change fixes the warning in:
> > https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
> >
> > Changes in v3:
> > - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
> > as per Matthew's suggestion.
> > - Update comments in pageblock-flags.h for pageblock_order
> > value when THP or HugeTLB are not used.
> >
> > Changes in v2:
> > - Add Zi's Acked-by tag.
> > - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
> > per Zi and Matthew suggestion so it is available to
> > all the architectures.
> > - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
> > ARCH_FORCE_MAX_ORDER is not available.
> >
> > include/linux/mmzone.h | 16 ++++++++++++++++
> > include/linux/pageblock-flags.h | 8 ++++----
> > mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
> > mm/mm_init.c | 2 +-
> > 4 files changed, 55 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 6ccec1bf2896..05610337bbb6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -37,6 +37,22 @@
> >
> > #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
> >
> > +/* Defines the order for the number of pages that have a migrate type. */
> > +#ifndef CONFIG_PAGE_BLOCK_ORDER
> > +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> > +#else
> > +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> > +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> > +
> > +/*
> > + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> > + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> > + * which defines the order for the number of pages that can have a migrate type
> > + */
> > +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> > +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> > +#endif
> > +
> > /*
> > * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> > * costly to service. That is between allocation orders which should
> > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> > index fc6b9c87cb0a..e73a4292ef02 100644
> > --- a/include/linux/pageblock-flags.h
> > +++ b/include/linux/pageblock-flags.h
> > @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
> > * Huge pages are a constant size, but don't exceed the maximum allocation
> > * granularity.
> > */
> > -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
> >
> > #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
> >
> > #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >
> > -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> > -#define pageblock_order MAX_PAGE_ORDER
> > +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
> > +#define pageblock_order PAGE_BLOCK_ORDER
> >
> > #endif /* CONFIG_HUGETLB_PAGE */
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index e113f713b493..13a5c4f6e6b6 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -989,6 +989,40 @@ config CMA_AREAS
> >
> > If unsure, leave the default value "8" in UMA and "20" in NUMA.
> >
> > +#
> > +# Select this config option from the architecture Kconfig, if available, to set
> > +# the max page order for physically contiguous allocations.
> > +#
> > +config ARCH_FORCE_MAX_ORDER
> > + int
> > +
> > +#
> > +# When ARCH_FORCE_MAX_ORDER is not defined,
> > +# the default page block order is MAX_PAGE_ORDER (10) as per
> > +# include/linux/mmzone.h.
> > +#
> > +config PAGE_BLOCK_ORDER
> > + int "Page Block Order"
> > + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
> > + default 10 if ARCH_FORCE_MAX_ORDER = 0
> > + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> > + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
> > + help
> > + The page block order refers to the power of two number of pages that
> > + are physically contiguous and can have a migrate type associated to
> > + them. The maximum size of the page block order is limited by
> > + ARCH_FORCE_MAX_ORDER.
> > +
> > + This config allows overriding the default page block order when the
> > + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
> > + or MAX_PAGE_ORDER.
> > +
> > + Reducing pageblock order can negatively impact THP generation
> > + success rate. If your workloads uses THP heavily, please use this
> > + option with caution.
> > +
> > + Don't change if unsure.
>
>
> The semantics are now very confusing [1]. The default in x86-64 will be
> 10, so we'll have
>
> CONFIG_PAGE_BLOCK_ORDER=10
>
>
> But then, we'll do this
>
> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER,
> PAGE_BLOCK_ORDER)
>
>
> So the actual pageblock order will be different than
> CONFIG_PAGE_BLOCK_ORDER.
>
> Confusing.
I agree that it becomes confusing due that pageblock_order value
depends on whether THP, HugeTLB
are set or not.
>
> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL
> ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>
We could rename the configuration to CONFIG_PAGE_BLOCK_ORDER_CEIL.
> [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
2025-06-03 15:14 ` Zi Yan
@ 2025-06-03 15:42 ` David Hildenbrand
0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-03 15:42 UTC (permalink / raw)
To: Zi Yan
Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh,
masahiroy, Minchan Kim
On 03.06.25 17:14, Zi Yan wrote:
> On 3 Jun 2025, at 10:55, Zi Yan wrote:
>
>> On 3 Jun 2025, at 9:03, David Hildenbrand wrote:
>>
>>> On 21.05.25 23:57, Juan Yescas wrote:
>>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA
>>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
>>>> and this causes the CMA reservations to be larger than necessary.
>>>> This means that system will have less available MIGRATE_UNMOVABLE and
>>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
>>>>
>>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
>>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
>>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
>>>>
>>>> For example, in ARM, the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is set:
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>>> -----------------------------------------------------------------------
>>>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
>>>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
>>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> There are some extreme cases for the CMA alignment requirement when:
>>>>
>>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
>>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
>>>> - CONFIG_HUGETLB_PAGE is NOT set
>>>>
>>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
>>>> ------------------------------------------------------------------------
>>>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
>>>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
>>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
>>>>
>>>> This affects the CMA reservations for the drivers. If a driver in a
>>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
>>>> reservation has to be 32MiB due to the alignment requirements:
>>>>
>>>> reserved-memory {
>>>> ...
>>>> cma_test_reserve: cma_test_reserve {
>>>> compatible = "shared-dma-pool";
>>>> size = <0x0 0x400000>; /* 4 MiB */
>>>> ...
>>>> };
>>>> };
>>>>
>>>> reserved-memory {
>>>> ...
>>>> cma_test_reserve: cma_test_reserve {
>>>> compatible = "shared-dma-pool";
>>>> size = <0x0 0x2000000>; /* 32 MiB */
>>>> ...
>>>> };
>>>> };
>>>>
>>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
>>>> allows to set the page block order in all the architectures.
>>>> The maximum page block order will be given by
>>>> ARCH_FORCE_MAX_ORDER.
>>>>
>>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
>>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that
>>>> current kernel configurations won't be affected by this
>>>> change. It is a opt-in change.
>>>>
>>>> This patch will allow to have the same CMA alignment
>>>> requirements for large page sizes (16KiB, 64KiB) as that
>>>> in 4kb kernels by setting a lower pageblock_order.
>>>>
>>>> Tests:
>>>>
>>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
>>>> on 4k and 16k kernels.
>>>>
>>>> - Verified that Transparent Huge Pages work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> - Verified that dma-buf heaps allocations work when pageblock_order
>>>> is 1, 7, 10 on 4k and 16k kernels.
>>>>
>>>> Benchmarks:
>>>>
>>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
>>>> reason for the pageblock_order 7 is because this value makes the min
>>>> CMA alignment requirement the same as that in 4kb kernels (2MB).
>>>>
>>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
>>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
>>>> (https://developer.android.com/ndk/guides/simpleperf) to measure
>>>> the # of instructions and page-faults on 16k kernels.
>>>> The benchmark was executed 10 times. The averages are below:
>>>>
>>>> # instructions | #page-faults
>>>> order 10 | order 7 | order 10 | order 7
>>>> --------------------------------------------------------
>>>> 13,891,765,770 | 11,425,777,314 | 220 | 217
>>>> 14,456,293,487 | 12,660,819,302 | 224 | 219
>>>> 13,924,261,018 | 13,243,970,736 | 217 | 221
>>>> 13,910,886,504 | 13,845,519,630 | 217 | 221
>>>> 14,388,071,190 | 13,498,583,098 | 223 | 224
>>>> 13,656,442,167 | 12,915,831,681 | 216 | 218
>>>> 13,300,268,343 | 12,930,484,776 | 222 | 218
>>>> 13,625,470,223 | 14,234,092,777 | 219 | 218
>>>> 13,508,964,965 | 13,432,689,094 | 225 | 219
>>>> 13,368,950,667 | 13,683,587,37 | 219 | 225
>>>> -------------------------------------------------------------------
>>>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
>>>>
>>>> There were 4.85% #instructions when order was 7, in comparison
>>>> with order 10.
>>>>
>>>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
>>>>
>>>> The number of page faults in order 7 and 10 were the same.
>>>>
>>>> These results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>>>> on the 16k kernels with pageblock_order 7 and 10.
>>>>
>>>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
>>>> -------------------------------------------------------------------
>>>> 15.8 | 16.4 | 0.6 | 3.80%
>>>> 16.4 | 16.2 | -0.2 | -1.22%
>>>> 16.6 | 16.3 | -0.3 | -1.81%
>>>> 16.8 | 16.3 | -0.5 | -2.98%
>>>> 16.6 | 16.8 | 0.2 | 1.20%
>>>> -------------------------------------------------------------------
>>>> 16.44 16.4 -0.04 -0.24% Averages
>>>>
>>>> The results didn't show any significant regression when the
>>>> pageblock_order is set to 7 on 16kb kernels.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> CC: Mike Rapoport <rppt@kernel.org>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Signed-off-by: Juan Yescas <jyescas@google.com>
>>>> Acked-by: Zi Yan <ziy@nvidia.com>
>>>> ---
>>>> Changes in v7:
>>>> - Update alignment calculation to 2MiB as per David's
>>>> observation.
>>>> - Update page block order calculation in mm/mm_init.c for
>>>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
>>>>
>>>> Changes in v6:
>>>> - Applied the change provided by Zi Yan to fix
>>>> the Kconfig. The change consists in evaluating
>>>> to true or false in the if expression for range:
>>>> range 1 <symbol> if <expression to eval true/false>.
>>>>
>>>> Changes in v5:
>>>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
>>>> ranges with config definitions don't work in Kconfig,
>>>> for example (range 1 MY_CONFIG).
>>>> - Add PAGE_BLOCK_ORDER_MANUAL config for the
>>>> page block order number. The default value was not
>>>> defined.
>>>> - Fix typos reported by Andrew.
>>>> - Test default configs in powerpc.
>>>>
>>>> Changes in v4:
>>>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
>>>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
>>>> compile time.
>>>> - This change fixes the warning in:
>>>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
>>>>
>>>> Changes in v3:
>>>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
>>>> as per Matthew's suggestion.
>>>> - Update comments in pageblock-flags.h for pageblock_order
>>>> value when THP or HugeTLB are not used.
>>>>
>>>> Changes in v2:
>>>> - Add Zi's Acked-by tag.
>>>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
>>>> per Zi and Matthew suggestion so it is available to
>>>> all the architectures.
>>>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
>>>> ARCH_FORCE_MAX_ORDER is not available.
>>>>
>>>> include/linux/mmzone.h | 16 ++++++++++++++++
>>>> include/linux/pageblock-flags.h | 8 ++++----
>>>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
>>>> mm/mm_init.c | 2 +-
>>>> 4 files changed, 55 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 6ccec1bf2896..05610337bbb6 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -37,6 +37,22 @@
>>>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
>>>> +/* Defines the order for the number of pages that have a migrate type. */
>>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER
>>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
>>>> +#else
>>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
>>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
>>>> +
>>>> +/*
>>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
>>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
>>>> + * which defines the order for the number of pages that can have a migrate type
>>>> + */
>>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
>>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
>>>> +#endif
>>>> +
>>>> /*
>>>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>>>> * costly to service. That is between allocation orders which should
>>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
>>>> index fc6b9c87cb0a..e73a4292ef02 100644
>>>> --- a/include/linux/pageblock-flags.h
>>>> +++ b/include/linux/pageblock-flags.h
>>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>>>> * Huge pages are a constant size, but don't exceed the maximum allocation
>>>> * granularity.
>>>> */
>>>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>>>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>>>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
>>>> -#define pageblock_order MAX_PAGE_ORDER
>>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
>>>> +#define pageblock_order PAGE_BLOCK_ORDER
>>>> #endif /* CONFIG_HUGETLB_PAGE */
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index e113f713b493..13a5c4f6e6b6 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -989,6 +989,40 @@ config CMA_AREAS
>>>> If unsure, leave the default value "8" in UMA and "20" in NUMA.
>>>> +#
>>>> +# Select this config option from the architecture Kconfig, if available, to set
>>>> +# the max page order for physically contiguous allocations.
>>>> +#
>>>> +config ARCH_FORCE_MAX_ORDER
>>>> + int
>>>> +
>>>> +#
>>>> +# When ARCH_FORCE_MAX_ORDER is not defined,
>>>> +# the default page block order is MAX_PAGE_ORDER (10) as per
>>>> +# include/linux/mmzone.h.
>>>> +#
>>>> +config PAGE_BLOCK_ORDER
>>>> + int "Page Block Order"
>>>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> + default 10 if ARCH_FORCE_MAX_ORDER = 0
>>>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
>>>> + help
>>>> + The page block order refers to the power of two number of pages that
>>>> + are physically contiguous and can have a migrate type associated to
>>>> + them. The maximum size of the page block order is limited by
>>>> + ARCH_FORCE_MAX_ORDER.
>>>> +
>>>> + This config allows overriding the default page block order when the
>>>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
>>>> + or MAX_PAGE_ORDER.
>>>> +
>>>> + Reducing pageblock order can negatively impact THP generation
>>>> + success rate. If your workloads uses THP heavily, please use this
>>>> + option with caution.
>>>> +
>>>> + Don't change if unsure.
>>>
>>>
>>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have
>>>
>>> CONFIG_PAGE_BLOCK_ORDER=10
>>>
>>>
>>> But then, we'll do this
>>>
>>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>>
>>>
>>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER.
>>>
>>> Confusing.
>>>
>>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed.
>>
>> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region
>> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me.
>
> LIMIT might be still ambiguous, since it can be lower limit or upper limit.
> CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with,
> if it looks good to you, I can send it out properly.
LGTM
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-06-03 15:42 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
2025-05-28 8:21 ` Vlastimil Babka
2025-05-28 18:24 ` Andrew Morton
2025-06-03 13:03 ` David Hildenbrand
2025-06-03 14:55 ` Zi Yan
2025-06-03 15:14 ` Zi Yan
2025-06-03 15:42 ` David Hildenbrand
2025-06-03 15:20 ` Juan Yescas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox