* [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
@ 2025-05-21 21:57 Juan Yescas
2025-05-28 8:21 ` Vlastimil Babka
2025-06-03 13:03 ` David Hildenbrand
0 siblings, 2 replies; 8+ messages in thread
From: Juan Yescas @ 2025-05-21 21:57 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Juan Yescas, Zi Yan, linux-mm,
linux-kernel
Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim
Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.
This means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
For example, in ARM, the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
There are some extreme cases for the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x400000>; /* 4 MiB */
...
};
};
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x2000000>; /* 32 MiB */
...
};
};
Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
allows to set the page block order in all the architectures.
The maximum page block order will be given by
ARCH_FORCE_MAX_ORDER.
By default, CONFIG_PAGE_BLOCK_ORDER will have the same
value that ARCH_FORCE_MAX_ORDER. This will make sure that
current kernel configurations won't be affected by this
change. It is a opt-in change.
This patch will allow to have the same CMA alignment
requirements for large page sizes (16KiB, 64KiB) as that
in 4kb kernels by setting a lower pageblock_order.
Tests:
- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
on 4k and 16k kernels.
- Verified that Transparent Huge Pages work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.
- Verified that dma-buf heaps allocations work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.
Benchmarks:
The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min
CMA alignment requirement the same as that in 4kb kernels (2MB).
- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure
the # of instructions and page-faults on 16k kernels.
The benchmark was executed 10 times. The averages are below:
# instructions | #page-faults
order 10 | order 7 | order 10 | order 7
--------------------------------------------------------
13,891,765,770 | 11,425,777,314 | 220 | 217
14,456,293,487 | 12,660,819,302 | 224 | 219
13,924,261,018 | 13,243,970,736 | 217 | 221
13,910,886,504 | 13,845,519,630 | 217 | 221
14,388,071,190 | 13,498,583,098 | 223 | 224
13,656,442,167 | 12,915,831,681 | 216 | 218
13,300,268,343 | 12,930,484,776 | 222 | 218
13,625,470,223 | 14,234,092,777 | 219 | 218
13,508,964,965 | 13,432,689,094 | 225 | 219
13,368,950,667 | 13,683,587,37 | 219 | 225
-------------------------------------------------------------------
13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
There were 4.85% #instructions when order was 7, in comparison
with order 10.
13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
The number of page faults in order 7 and 10 were the same.
These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
on the 16k kernels with pageblock_order 7 and 10.
order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
15.8 | 16.4 | 0.6 | 3.80%
16.4 | 16.2 | -0.2 | -1.22%
16.6 | 16.3 | -0.3 | -1.81%
16.8 | 16.3 | -0.5 | -2.98%
16.6 | 16.8 | 0.2 | 1.20%
-------------------------------------------------------------------
16.44 16.4 -0.04 -0.24% Averages
The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
CC: Mike Rapoport <rppt@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---
Changes in v7:
- Update alignment calculation to 2MiB as per David's
observation.
- Update page block order calculation in mm/mm_init.c for
powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set.
Changes in v6:
- Applied the change provided by Zi Yan to fix
the Kconfig. The change consists in evaluating
to true or false in the if expression for range:
range 1 <symbol> if <expression to eval true/false>.
Changes in v5:
- Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
ranges with config definitions don't work in Kconfig,
for example (range 1 MY_CONFIG).
- Add PAGE_BLOCK_ORDER_MANUAL config for the
page block order number. The default value was not
defined.
- Fix typos reported by Andrew.
- Test default configs in powerpc.
Changes in v4:
- Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
compile time.
- This change fixes the warning in:
https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/
Changes in v3:
- Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
as per Matthew's suggestion.
- Update comments in pageblock-flags.h for pageblock_order
value when THP or HugeTLB are not used.
Changes in v2:
- Add Zi's Acked-by tag.
- Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
per Zi and Matthew suggestion so it is available to
all the architectures.
- Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
ARCH_FORCE_MAX_ORDER is not available.
include/linux/mmzone.h | 16 ++++++++++++++++
include/linux/pageblock-flags.h | 8 ++++----
mm/Kconfig | 34 +++++++++++++++++++++++++++++++++
mm/mm_init.c | 2 +-
4 files changed, 55 insertions(+), 5 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6ccec1bf2896..05610337bbb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,6 +37,22 @@
#define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
+/* Defines the order for the number of pages that have a migrate type. */
+#ifndef CONFIG_PAGE_BLOCK_ORDER
+#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#else
+#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
+#endif /* CONFIG_PAGE_BLOCK_ORDER */
+
+/*
+ * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * which defines the order for the number of pages that can have a migrate type
+ */
+#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#endif
+
/*
* PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
* costly to service. That is between allocation orders which should
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fc6b9c87cb0a..e73a4292ef02 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
* Huge pages are a constant size, but don't exceed the maximum allocation
* granularity.
*/
-#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
-#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order MAX_PAGE_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
+#define pageblock_order PAGE_BLOCK_ORDER
#endif /* CONFIG_HUGETLB_PAGE */
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..13a5c4f6e6b6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -989,6 +989,40 @@ config CMA_AREAS
If unsure, leave the default value "8" in UMA and "20" in NUMA.
+#
+# Select this config option from the architecture Kconfig, if available, to set
+# the max page order for physically contiguous allocations.
+#
+config ARCH_FORCE_MAX_ORDER
+ int
+
+#
+# When ARCH_FORCE_MAX_ORDER is not defined,
+# the default page block order is MAX_PAGE_ORDER (10) as per
+# include/linux/mmzone.h.
+#
+config PAGE_BLOCK_ORDER
+ int "Page Block Order"
+ range 1 10 if ARCH_FORCE_MAX_ORDER = 0
+ default 10 if ARCH_FORCE_MAX_ORDER = 0
+ range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+ default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+ help
+ The page block order refers to the power of two number of pages that
+ are physically contiguous and can have a migrate type associated to
+ them. The maximum size of the page block order is limited by
+ ARCH_FORCE_MAX_ORDER.
+
+ This config allows overriding the default page block order when the
+ page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
+ or MAX_PAGE_ORDER.
+
+ Reducing pageblock order can negatively impact THP generation
+ success rate. If your workloads uses THP heavily, please use this
+ option with caution.
+
+ Don't change if unsure.
+
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 327764ca0ee4..ada5374764e4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1511,7 +1511,7 @@ static inline void setup_usemap(struct zone *zone) {}
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
{
- unsigned int order = MAX_PAGE_ORDER;
+ unsigned int order = PAGE_BLOCK_ORDER;
/* Check that pageblock_nr_pages has not already been setup */
if (pageblock_order)
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas @ 2025-05-28 8:21 ` Vlastimil Babka 2025-05-28 18:24 ` Andrew Morton 2025-06-03 13:03 ` David Hildenbrand 1 sibling, 1 reply; 8+ messages in thread From: Vlastimil Babka @ 2025-05-28 8:21 UTC (permalink / raw) To: Juan Yescas, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm, linux-kernel Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On 5/21/25 23:57, Juan Yescas wrote: > Problem: On large page size configurations (16KiB, 64KiB), the CMA > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, > and this causes the CMA reservations to be larger than necessary. > This means that system will have less available MIGRATE_UNMOVABLE and > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. > > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. > > For example, in ARM, the CMA alignment requirement when: > > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used > - CONFIG_TRANSPARENT_HUGEPAGE is set: > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > ----------------------------------------------------------------------- > 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB > 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > There are some extreme cases for the CMA alignment requirement when: > > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: > - CONFIG_HUGETLB_PAGE is NOT set > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > ------------------------------------------------------------------------ > 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB > 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > This affects the CMA reservations for the drivers. If a driver in a > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal > reservation has to be 32MiB due to the alignment requirements: > > reserved-memory { > ... > cma_test_reserve: cma_test_reserve { > compatible = "shared-dma-pool"; > size = <0x0 0x400000>; /* 4 MiB */ > ... > }; > }; > > reserved-memory { > ... > cma_test_reserve: cma_test_reserve { > compatible = "shared-dma-pool"; > size = <0x0 0x2000000>; /* 32 MiB */ > ... > }; > }; > > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that > allows to set the page block order in all the architectures. > The maximum page block order will be given by > ARCH_FORCE_MAX_ORDER. > > By default, CONFIG_PAGE_BLOCK_ORDER will have the same > value that ARCH_FORCE_MAX_ORDER. This will make sure that > current kernel configurations won't be affected by this > change. It is a opt-in change. > > This patch will allow to have the same CMA alignment > requirements for large page sizes (16KiB, 64KiB) as that > in 4kb kernels by setting a lower pageblock_order. > > Tests: > > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 > on 4k and 16k kernels. > > - Verified that Transparent Huge Pages work when pageblock_order > is 1, 7, 10 on 4k and 16k kernels. > > - Verified that dma-buf heaps allocations work when pageblock_order > is 1, 7, 10 on 4k and 16k kernels. > > Benchmarks: > > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The > reason for the pageblock_order 7 is because this value makes the min > CMA alignment requirement the same as that in 4kb kernels (2MB). > > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf > (https://developer.android.com/ndk/guides/simpleperf) to measure > the # of instructions and page-faults on 16k kernels. > The benchmark was executed 10 times. The averages are below: > > # instructions | #page-faults > order 10 | order 7 | order 10 | order 7 > -------------------------------------------------------- > 13,891,765,770 | 11,425,777,314 | 220 | 217 > 14,456,293,487 | 12,660,819,302 | 224 | 219 > 13,924,261,018 | 13,243,970,736 | 217 | 221 > 13,910,886,504 | 13,845,519,630 | 217 | 221 > 14,388,071,190 | 13,498,583,098 | 223 | 224 > 13,656,442,167 | 12,915,831,681 | 216 | 218 > 13,300,268,343 | 12,930,484,776 | 222 | 218 > 13,625,470,223 | 14,234,092,777 | 219 | 218 > 13,508,964,965 | 13,432,689,094 | 225 | 219 > 13,368,950,667 | 13,683,587,37 | 219 | 225 > ------------------------------------------------------------------- > 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages > > There were 4.85% #instructions when order was 7, in comparison > with order 10. > > 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) > > The number of page faults in order 7 and 10 were the same. > > These results didn't show any significant regression when the > pageblock_order is set to 7 on 16kb kernels. > > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times > on the 16k kernels with pageblock_order 7 and 10. > > order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % > ------------------------------------------------------------------- > 15.8 | 16.4 | 0.6 | 3.80% > 16.4 | 16.2 | -0.2 | -1.22% > 16.6 | 16.3 | -0.3 | -1.81% > 16.8 | 16.3 | -0.5 | -2.98% > 16.6 | 16.8 | 0.2 | 1.20% > ------------------------------------------------------------------- > 16.44 16.4 -0.04 -0.24% Averages > > The results didn't show any significant regression when the > pageblock_order is set to 7 on 16kb kernels. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: David Hildenbrand <david@redhat.com> > CC: Mike Rapoport <rppt@kernel.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Minchan Kim <minchan@kernel.org> > Signed-off-by: Juan Yescas <jyescas@google.com> > Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-05-28 8:21 ` Vlastimil Babka @ 2025-05-28 18:24 ` Andrew Morton 0 siblings, 0 replies; 8+ messages in thread From: Andrew Morton @ 2025-05-28 18:24 UTC (permalink / raw) To: Vlastimil Babka Cc: Juan Yescas, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On Wed, 28 May 2025 10:21:46 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Great, thanks. I'll move this patch into mm-stable. I'll be sending two MM pull requests to Linus this cycle. The below patches will be in the second batch, next week. # m68k-remove-use-of-page-index.patch mm-rename-page-index-to-page-__folio_index.patch # ntfs3-use-folios-more-in-ntfs_compress_write.patch iov-remove-copy_page_from_iter_atomic.patch # zram-rename-zcomp_param_no_level.patch zram-support-deflate-specific-params.patch # selftests-mm-deduplicate-test-logging-in-test_mlock_lock.patch # selftests-mm-deduplicate-default-page-size-test-results-in-thuge-gen.patch # memcg-disable-kmem-charging-in-nmi-for-unsupported-arch.patch memcg-nmi-safe-memcg-stats-for-specific-archs.patch memcg-add-nmi-safe-update-for-memcg_kmem.patch memcg-nmi-safe-slab-stats-updates.patch memcg-make-memcg_rstat_updated-nmi-safe.patch # mm-damon-core-avoid-destroyed-target-reference-from-damos-quota.patch # mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch # selftests-mm-skip-guard_regionsuffd-tests-when-uffd-is-not-present.patch selftests-mm-skip-hugevm-test-if-kernel-config-file-is-not-present.patch # hugetlb-show-nr_huge_pages-in-report_hugepages.patch # # mm-damon-kconfig-set-damon_vaddrpaddrsysfs-default-to-damon.patch mm-damon-kconfig-enable-config_damon-by-default.patch # mmu_gather-move-tlb-flush-for-vm_pfnmap-vm_mixedmap-vmas-into-free_pgtables.patch # mm-rust-make-config_mmu-ifdefs-more-narrow.patch # kcov-rust-add-flags-for-kcov-with-rust.patch # # selftests-mm-deduplicate-test-names-in-madv_populate.patch # mmu_notifiers-remove-leftover-stub-macros.patch # mm-add-config_page_block_order-to-select-page-block-order.patch # ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas 2025-05-28 8:21 ` Vlastimil Babka @ 2025-06-03 13:03 ` David Hildenbrand 2025-06-03 14:55 ` Zi Yan 2025-06-03 15:20 ` Juan Yescas 1 sibling, 2 replies; 8+ messages in thread From: David Hildenbrand @ 2025-06-03 13:03 UTC (permalink / raw) To: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm, linux-kernel Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On 21.05.25 23:57, Juan Yescas wrote: > Problem: On large page size configurations (16KiB, 64KiB), the CMA > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, > and this causes the CMA reservations to be larger than necessary. > This means that system will have less available MIGRATE_UNMOVABLE and > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. > > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. > > For example, in ARM, the CMA alignment requirement when: > > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used > - CONFIG_TRANSPARENT_HUGEPAGE is set: > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > ----------------------------------------------------------------------- > 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB > 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > There are some extreme cases for the CMA alignment requirement when: > > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: > - CONFIG_HUGETLB_PAGE is NOT set > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > ------------------------------------------------------------------------ > 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB > 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > This affects the CMA reservations for the drivers. If a driver in a > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal > reservation has to be 32MiB due to the alignment requirements: > > reserved-memory { > ... > cma_test_reserve: cma_test_reserve { > compatible = "shared-dma-pool"; > size = <0x0 0x400000>; /* 4 MiB */ > ... > }; > }; > > reserved-memory { > ... > cma_test_reserve: cma_test_reserve { > compatible = "shared-dma-pool"; > size = <0x0 0x2000000>; /* 32 MiB */ > ... > }; > }; > > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that > allows to set the page block order in all the architectures. > The maximum page block order will be given by > ARCH_FORCE_MAX_ORDER. > > By default, CONFIG_PAGE_BLOCK_ORDER will have the same > value that ARCH_FORCE_MAX_ORDER. This will make sure that > current kernel configurations won't be affected by this > change. It is a opt-in change. > > This patch will allow to have the same CMA alignment > requirements for large page sizes (16KiB, 64KiB) as that > in 4kb kernels by setting a lower pageblock_order. > > Tests: > > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 > on 4k and 16k kernels. > > - Verified that Transparent Huge Pages work when pageblock_order > is 1, 7, 10 on 4k and 16k kernels. > > - Verified that dma-buf heaps allocations work when pageblock_order > is 1, 7, 10 on 4k and 16k kernels. > > Benchmarks: > > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The > reason for the pageblock_order 7 is because this value makes the min > CMA alignment requirement the same as that in 4kb kernels (2MB). > > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf > (https://developer.android.com/ndk/guides/simpleperf) to measure > the # of instructions and page-faults on 16k kernels. > The benchmark was executed 10 times. The averages are below: > > # instructions | #page-faults > order 10 | order 7 | order 10 | order 7 > -------------------------------------------------------- > 13,891,765,770 | 11,425,777,314 | 220 | 217 > 14,456,293,487 | 12,660,819,302 | 224 | 219 > 13,924,261,018 | 13,243,970,736 | 217 | 221 > 13,910,886,504 | 13,845,519,630 | 217 | 221 > 14,388,071,190 | 13,498,583,098 | 223 | 224 > 13,656,442,167 | 12,915,831,681 | 216 | 218 > 13,300,268,343 | 12,930,484,776 | 222 | 218 > 13,625,470,223 | 14,234,092,777 | 219 | 218 > 13,508,964,965 | 13,432,689,094 | 225 | 219 > 13,368,950,667 | 13,683,587,37 | 219 | 225 > ------------------------------------------------------------------- > 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages > > There were 4.85% #instructions when order was 7, in comparison > with order 10. > > 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) > > The number of page faults in order 7 and 10 were the same. > > These results didn't show any significant regression when the > pageblock_order is set to 7 on 16kb kernels. > > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times > on the 16k kernels with pageblock_order 7 and 10. > > order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % > ------------------------------------------------------------------- > 15.8 | 16.4 | 0.6 | 3.80% > 16.4 | 16.2 | -0.2 | -1.22% > 16.6 | 16.3 | -0.3 | -1.81% > 16.8 | 16.3 | -0.5 | -2.98% > 16.6 | 16.8 | 0.2 | 1.20% > ------------------------------------------------------------------- > 16.44 16.4 -0.04 -0.24% Averages > > The results didn't show any significant regression when the > pageblock_order is set to 7 on 16kb kernels. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: David Hildenbrand <david@redhat.com> > CC: Mike Rapoport <rppt@kernel.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Minchan Kim <minchan@kernel.org> > Signed-off-by: Juan Yescas <jyescas@google.com> > Acked-by: Zi Yan <ziy@nvidia.com> > --- > Changes in v7: > - Update alignment calculation to 2MiB as per David's > observation. > - Update page block order calculation in mm/mm_init.c for > powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set. > > Changes in v6: > - Applied the change provided by Zi Yan to fix > the Kconfig. The change consists in evaluating > to true or false in the if expression for range: > range 1 <symbol> if <expression to eval true/false>. > > Changes in v5: > - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The > ranges with config definitions don't work in Kconfig, > for example (range 1 MY_CONFIG). > - Add PAGE_BLOCK_ORDER_MANUAL config for the > page block order number. The default value was not > defined. > - Fix typos reported by Andrew. > - Test default configs in powerpc. > > Changes in v4: > - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to > validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at > compile time. > - This change fixes the warning in: > https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/ > > Changes in v3: > - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER > as per Matthew's suggestion. > - Update comments in pageblock-flags.h for pageblock_order > value when THP or HugeTLB are not used. > > Changes in v2: > - Add Zi's Acked-by tag. > - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as > per Zi and Matthew suggestion so it is available to > all the architectures. > - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when > ARCH_FORCE_MAX_ORDER is not available. > > include/linux/mmzone.h | 16 ++++++++++++++++ > include/linux/pageblock-flags.h | 8 ++++---- > mm/Kconfig | 34 +++++++++++++++++++++++++++++++++ > mm/mm_init.c | 2 +- > 4 files changed, 55 insertions(+), 5 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6ccec1bf2896..05610337bbb6 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -37,6 +37,22 @@ > > #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) > > +/* Defines the order for the number of pages that have a migrate type. */ > +#ifndef CONFIG_PAGE_BLOCK_ORDER > +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER > +#else > +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER > +#endif /* CONFIG_PAGE_BLOCK_ORDER */ > + > +/* > + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated > + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, > + * which defines the order for the number of pages that can have a migrate type > + */ > +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) > +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER > +#endif > + > /* > * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed > * costly to service. That is between allocation orders which should > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h > index fc6b9c87cb0a..e73a4292ef02 100644 > --- a/include/linux/pageblock-flags.h > +++ b/include/linux/pageblock-flags.h > @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; > * Huge pages are a constant size, but don't exceed the maximum allocation > * granularity. > */ > -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) > +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) > > #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ > > #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) > > -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER) > +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) > > #else /* CONFIG_TRANSPARENT_HUGEPAGE */ > > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ > -#define pageblock_order MAX_PAGE_ORDER > +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ > +#define pageblock_order PAGE_BLOCK_ORDER > > #endif /* CONFIG_HUGETLB_PAGE */ > > diff --git a/mm/Kconfig b/mm/Kconfig > index e113f713b493..13a5c4f6e6b6 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -989,6 +989,40 @@ config CMA_AREAS > > If unsure, leave the default value "8" in UMA and "20" in NUMA. > > +# > +# Select this config option from the architecture Kconfig, if available, to set > +# the max page order for physically contiguous allocations. > +# > +config ARCH_FORCE_MAX_ORDER > + int > + > +# > +# When ARCH_FORCE_MAX_ORDER is not defined, > +# the default page block order is MAX_PAGE_ORDER (10) as per > +# include/linux/mmzone.h. > +# > +config PAGE_BLOCK_ORDER > + int "Page Block Order" > + range 1 10 if ARCH_FORCE_MAX_ORDER = 0 > + default 10 if ARCH_FORCE_MAX_ORDER = 0 > + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 > + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 > + help > + The page block order refers to the power of two number of pages that > + are physically contiguous and can have a migrate type associated to > + them. The maximum size of the page block order is limited by > + ARCH_FORCE_MAX_ORDER. > + > + This config allows overriding the default page block order when the > + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER > + or MAX_PAGE_ORDER. > + > + Reducing pageblock order can negatively impact THP generation > + success rate. If your workloads uses THP heavily, please use this > + option with caution. > + > + Don't change if unsure. The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have CONFIG_PAGE_BLOCK_ORDER=10 But then, we'll do this #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER. Confusing. Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed. [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928 -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-06-03 13:03 ` David Hildenbrand @ 2025-06-03 14:55 ` Zi Yan 2025-06-03 15:14 ` Zi Yan 2025-06-03 15:20 ` Juan Yescas 1 sibling, 1 reply; 8+ messages in thread From: Zi Yan @ 2025-06-03 14:55 UTC (permalink / raw) To: David Hildenbrand Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On 3 Jun 2025, at 9:03, David Hildenbrand wrote: > On 21.05.25 23:57, Juan Yescas wrote: >> Problem: On large page size configurations (16KiB, 64KiB), the CMA >> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, >> and this causes the CMA reservations to be larger than necessary. >> This means that system will have less available MIGRATE_UNMOVABLE and >> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. >> >> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on >> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of >> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. >> >> For example, in ARM, the CMA alignment requirement when: >> >> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used >> - CONFIG_TRANSPARENT_HUGEPAGE is set: >> >> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >> ----------------------------------------------------------------------- >> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB >> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB >> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >> >> There are some extreme cases for the CMA alignment requirement when: >> >> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set >> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: >> - CONFIG_HUGETLB_PAGE is NOT set >> >> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >> ------------------------------------------------------------------------ >> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB >> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB >> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >> >> This affects the CMA reservations for the drivers. If a driver in a >> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal >> reservation has to be 32MiB due to the alignment requirements: >> >> reserved-memory { >> ... >> cma_test_reserve: cma_test_reserve { >> compatible = "shared-dma-pool"; >> size = <0x0 0x400000>; /* 4 MiB */ >> ... >> }; >> }; >> >> reserved-memory { >> ... >> cma_test_reserve: cma_test_reserve { >> compatible = "shared-dma-pool"; >> size = <0x0 0x2000000>; /* 32 MiB */ >> ... >> }; >> }; >> >> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that >> allows to set the page block order in all the architectures. >> The maximum page block order will be given by >> ARCH_FORCE_MAX_ORDER. >> >> By default, CONFIG_PAGE_BLOCK_ORDER will have the same >> value that ARCH_FORCE_MAX_ORDER. This will make sure that >> current kernel configurations won't be affected by this >> change. It is a opt-in change. >> >> This patch will allow to have the same CMA alignment >> requirements for large page sizes (16KiB, 64KiB) as that >> in 4kb kernels by setting a lower pageblock_order. >> >> Tests: >> >> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 >> on 4k and 16k kernels. >> >> - Verified that Transparent Huge Pages work when pageblock_order >> is 1, 7, 10 on 4k and 16k kernels. >> >> - Verified that dma-buf heaps allocations work when pageblock_order >> is 1, 7, 10 on 4k and 16k kernels. >> >> Benchmarks: >> >> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The >> reason for the pageblock_order 7 is because this value makes the min >> CMA alignment requirement the same as that in 4kb kernels (2MB). >> >> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of >> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf >> (https://developer.android.com/ndk/guides/simpleperf) to measure >> the # of instructions and page-faults on 16k kernels. >> The benchmark was executed 10 times. The averages are below: >> >> # instructions | #page-faults >> order 10 | order 7 | order 10 | order 7 >> -------------------------------------------------------- >> 13,891,765,770 | 11,425,777,314 | 220 | 217 >> 14,456,293,487 | 12,660,819,302 | 224 | 219 >> 13,924,261,018 | 13,243,970,736 | 217 | 221 >> 13,910,886,504 | 13,845,519,630 | 217 | 221 >> 14,388,071,190 | 13,498,583,098 | 223 | 224 >> 13,656,442,167 | 12,915,831,681 | 216 | 218 >> 13,300,268,343 | 12,930,484,776 | 222 | 218 >> 13,625,470,223 | 14,234,092,777 | 219 | 218 >> 13,508,964,965 | 13,432,689,094 | 225 | 219 >> 13,368,950,667 | 13,683,587,37 | 219 | 225 >> ------------------------------------------------------------------- >> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages >> >> There were 4.85% #instructions when order was 7, in comparison >> with order 10. >> >> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) >> >> The number of page faults in order 7 and 10 were the same. >> >> These results didn't show any significant regression when the >> pageblock_order is set to 7 on 16kb kernels. >> >> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times >> on the 16k kernels with pageblock_order 7 and 10. >> >> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % >> ------------------------------------------------------------------- >> 15.8 | 16.4 | 0.6 | 3.80% >> 16.4 | 16.2 | -0.2 | -1.22% >> 16.6 | 16.3 | -0.3 | -1.81% >> 16.8 | 16.3 | -0.5 | -2.98% >> 16.6 | 16.8 | 0.2 | 1.20% >> ------------------------------------------------------------------- >> 16.44 16.4 -0.04 -0.24% Averages >> >> The results didn't show any significant regression when the >> pageblock_order is set to 7 on 16kb kernels. >> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Cc: David Hildenbrand <david@redhat.com> >> CC: Mike Rapoport <rppt@kernel.org> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: Suren Baghdasaryan <surenb@google.com> >> Cc: Minchan Kim <minchan@kernel.org> >> Signed-off-by: Juan Yescas <jyescas@google.com> >> Acked-by: Zi Yan <ziy@nvidia.com> >> --- >> Changes in v7: >> - Update alignment calculation to 2MiB as per David's >> observation. >> - Update page block order calculation in mm/mm_init.c for >> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set. >> >> Changes in v6: >> - Applied the change provided by Zi Yan to fix >> the Kconfig. The change consists in evaluating >> to true or false in the if expression for range: >> range 1 <symbol> if <expression to eval true/false>. >> >> Changes in v5: >> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The >> ranges with config definitions don't work in Kconfig, >> for example (range 1 MY_CONFIG). >> - Add PAGE_BLOCK_ORDER_MANUAL config for the >> page block order number. The default value was not >> defined. >> - Fix typos reported by Andrew. >> - Test default configs in powerpc. >> >> Changes in v4: >> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to >> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at >> compile time. >> - This change fixes the warning in: >> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/ >> >> Changes in v3: >> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER >> as per Matthew's suggestion. >> - Update comments in pageblock-flags.h for pageblock_order >> value when THP or HugeTLB are not used. >> >> Changes in v2: >> - Add Zi's Acked-by tag. >> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as >> per Zi and Matthew suggestion so it is available to >> all the architectures. >> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when >> ARCH_FORCE_MAX_ORDER is not available. >> >> include/linux/mmzone.h | 16 ++++++++++++++++ >> include/linux/pageblock-flags.h | 8 ++++---- >> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++ >> mm/mm_init.c | 2 +- >> 4 files changed, 55 insertions(+), 5 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 6ccec1bf2896..05610337bbb6 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -37,6 +37,22 @@ >> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) >> +/* Defines the order for the number of pages that have a migrate type. */ >> +#ifndef CONFIG_PAGE_BLOCK_ORDER >> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER >> +#else >> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER >> +#endif /* CONFIG_PAGE_BLOCK_ORDER */ >> + >> +/* >> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated >> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, >> + * which defines the order for the number of pages that can have a migrate type >> + */ >> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) >> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER >> +#endif >> + >> /* >> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed >> * costly to service. That is between allocation orders which should >> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h >> index fc6b9c87cb0a..e73a4292ef02 100644 >> --- a/include/linux/pageblock-flags.h >> +++ b/include/linux/pageblock-flags.h >> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; >> * Huge pages are a constant size, but don't exceed the maximum allocation >> * granularity. >> */ >> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) >> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) >> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ >> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) >> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER) >> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) >> #else /* CONFIG_TRANSPARENT_HUGEPAGE */ >> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ >> -#define pageblock_order MAX_PAGE_ORDER >> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ >> +#define pageblock_order PAGE_BLOCK_ORDER >> #endif /* CONFIG_HUGETLB_PAGE */ >> diff --git a/mm/Kconfig b/mm/Kconfig >> index e113f713b493..13a5c4f6e6b6 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -989,6 +989,40 @@ config CMA_AREAS >> If unsure, leave the default value "8" in UMA and "20" in NUMA. >> +# >> +# Select this config option from the architecture Kconfig, if available, to set >> +# the max page order for physically contiguous allocations. >> +# >> +config ARCH_FORCE_MAX_ORDER >> + int >> + >> +# >> +# When ARCH_FORCE_MAX_ORDER is not defined, >> +# the default page block order is MAX_PAGE_ORDER (10) as per >> +# include/linux/mmzone.h. >> +# >> +config PAGE_BLOCK_ORDER >> + int "Page Block Order" >> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0 >> + default 10 if ARCH_FORCE_MAX_ORDER = 0 >> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >> + help >> + The page block order refers to the power of two number of pages that >> + are physically contiguous and can have a migrate type associated to >> + them. The maximum size of the page block order is limited by >> + ARCH_FORCE_MAX_ORDER. >> + >> + This config allows overriding the default page block order when the >> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER >> + or MAX_PAGE_ORDER. >> + >> + Reducing pageblock order can negatively impact THP generation >> + success rate. If your workloads uses THP heavily, please use this >> + option with caution. >> + >> + Don't change if unsure. > > > The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have > > CONFIG_PAGE_BLOCK_ORDER=10 > > > But then, we'll do this > > #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) > > > So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER. > > Confusing. > > Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed. IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me. > > [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928 > -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-06-03 14:55 ` Zi Yan @ 2025-06-03 15:14 ` Zi Yan 2025-06-03 15:42 ` David Hildenbrand 0 siblings, 1 reply; 8+ messages in thread From: Zi Yan @ 2025-06-03 15:14 UTC (permalink / raw) To: David Hildenbrand Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On 3 Jun 2025, at 10:55, Zi Yan wrote: > On 3 Jun 2025, at 9:03, David Hildenbrand wrote: > >> On 21.05.25 23:57, Juan Yescas wrote: >>> Problem: On large page size configurations (16KiB, 64KiB), the CMA >>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, >>> and this causes the CMA reservations to be larger than necessary. >>> This means that system will have less available MIGRATE_UNMOVABLE and >>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. >>> >>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on >>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of >>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. >>> >>> For example, in ARM, the CMA alignment requirement when: >>> >>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used >>> - CONFIG_TRANSPARENT_HUGEPAGE is set: >>> >>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >>> ----------------------------------------------------------------------- >>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB >>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB >>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >>> >>> There are some extreme cases for the CMA alignment requirement when: >>> >>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set >>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: >>> - CONFIG_HUGETLB_PAGE is NOT set >>> >>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >>> ------------------------------------------------------------------------ >>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB >>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB >>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >>> >>> This affects the CMA reservations for the drivers. If a driver in a >>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal >>> reservation has to be 32MiB due to the alignment requirements: >>> >>> reserved-memory { >>> ... >>> cma_test_reserve: cma_test_reserve { >>> compatible = "shared-dma-pool"; >>> size = <0x0 0x400000>; /* 4 MiB */ >>> ... >>> }; >>> }; >>> >>> reserved-memory { >>> ... >>> cma_test_reserve: cma_test_reserve { >>> compatible = "shared-dma-pool"; >>> size = <0x0 0x2000000>; /* 32 MiB */ >>> ... >>> }; >>> }; >>> >>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that >>> allows to set the page block order in all the architectures. >>> The maximum page block order will be given by >>> ARCH_FORCE_MAX_ORDER. >>> >>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same >>> value that ARCH_FORCE_MAX_ORDER. This will make sure that >>> current kernel configurations won't be affected by this >>> change. It is a opt-in change. >>> >>> This patch will allow to have the same CMA alignment >>> requirements for large page sizes (16KiB, 64KiB) as that >>> in 4kb kernels by setting a lower pageblock_order. >>> >>> Tests: >>> >>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 >>> on 4k and 16k kernels. >>> >>> - Verified that Transparent Huge Pages work when pageblock_order >>> is 1, 7, 10 on 4k and 16k kernels. >>> >>> - Verified that dma-buf heaps allocations work when pageblock_order >>> is 1, 7, 10 on 4k and 16k kernels. >>> >>> Benchmarks: >>> >>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The >>> reason for the pageblock_order 7 is because this value makes the min >>> CMA alignment requirement the same as that in 4kb kernels (2MB). >>> >>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of >>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf >>> (https://developer.android.com/ndk/guides/simpleperf) to measure >>> the # of instructions and page-faults on 16k kernels. >>> The benchmark was executed 10 times. The averages are below: >>> >>> # instructions | #page-faults >>> order 10 | order 7 | order 10 | order 7 >>> -------------------------------------------------------- >>> 13,891,765,770 | 11,425,777,314 | 220 | 217 >>> 14,456,293,487 | 12,660,819,302 | 224 | 219 >>> 13,924,261,018 | 13,243,970,736 | 217 | 221 >>> 13,910,886,504 | 13,845,519,630 | 217 | 221 >>> 14,388,071,190 | 13,498,583,098 | 223 | 224 >>> 13,656,442,167 | 12,915,831,681 | 216 | 218 >>> 13,300,268,343 | 12,930,484,776 | 222 | 218 >>> 13,625,470,223 | 14,234,092,777 | 219 | 218 >>> 13,508,964,965 | 13,432,689,094 | 225 | 219 >>> 13,368,950,667 | 13,683,587,37 | 219 | 225 >>> ------------------------------------------------------------------- >>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages >>> >>> There were 4.85% #instructions when order was 7, in comparison >>> with order 10. >>> >>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) >>> >>> The number of page faults in order 7 and 10 were the same. >>> >>> These results didn't show any significant regression when the >>> pageblock_order is set to 7 on 16kb kernels. >>> >>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times >>> on the 16k kernels with pageblock_order 7 and 10. >>> >>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % >>> ------------------------------------------------------------------- >>> 15.8 | 16.4 | 0.6 | 3.80% >>> 16.4 | 16.2 | -0.2 | -1.22% >>> 16.6 | 16.3 | -0.3 | -1.81% >>> 16.8 | 16.3 | -0.5 | -2.98% >>> 16.6 | 16.8 | 0.2 | 1.20% >>> ------------------------------------------------------------------- >>> 16.44 16.4 -0.04 -0.24% Averages >>> >>> The results didn't show any significant regression when the >>> pageblock_order is set to 7 on 16kb kernels. >>> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Vlastimil Babka <vbabka@suse.cz> >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>> Cc: David Hildenbrand <david@redhat.com> >>> CC: Mike Rapoport <rppt@kernel.org> >>> Cc: Zi Yan <ziy@nvidia.com> >>> Cc: Suren Baghdasaryan <surenb@google.com> >>> Cc: Minchan Kim <minchan@kernel.org> >>> Signed-off-by: Juan Yescas <jyescas@google.com> >>> Acked-by: Zi Yan <ziy@nvidia.com> >>> --- >>> Changes in v7: >>> - Update alignment calculation to 2MiB as per David's >>> observation. >>> - Update page block order calculation in mm/mm_init.c for >>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set. >>> >>> Changes in v6: >>> - Applied the change provided by Zi Yan to fix >>> the Kconfig. The change consists in evaluating >>> to true or false in the if expression for range: >>> range 1 <symbol> if <expression to eval true/false>. >>> >>> Changes in v5: >>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The >>> ranges with config definitions don't work in Kconfig, >>> for example (range 1 MY_CONFIG). >>> - Add PAGE_BLOCK_ORDER_MANUAL config for the >>> page block order number. The default value was not >>> defined. >>> - Fix typos reported by Andrew. >>> - Test default configs in powerpc. >>> >>> Changes in v4: >>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to >>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at >>> compile time. >>> - This change fixes the warning in: >>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/ >>> >>> Changes in v3: >>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER >>> as per Matthew's suggestion. >>> - Update comments in pageblock-flags.h for pageblock_order >>> value when THP or HugeTLB are not used. >>> >>> Changes in v2: >>> - Add Zi's Acked-by tag. >>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as >>> per Zi and Matthew suggestion so it is available to >>> all the architectures. >>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when >>> ARCH_FORCE_MAX_ORDER is not available. >>> >>> include/linux/mmzone.h | 16 ++++++++++++++++ >>> include/linux/pageblock-flags.h | 8 ++++---- >>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++ >>> mm/mm_init.c | 2 +- >>> 4 files changed, 55 insertions(+), 5 deletions(-) >>> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>> index 6ccec1bf2896..05610337bbb6 100644 >>> --- a/include/linux/mmzone.h >>> +++ b/include/linux/mmzone.h >>> @@ -37,6 +37,22 @@ >>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) >>> +/* Defines the order for the number of pages that have a migrate type. */ >>> +#ifndef CONFIG_PAGE_BLOCK_ORDER >>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER >>> +#else >>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER >>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */ >>> + >>> +/* >>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated >>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, >>> + * which defines the order for the number of pages that can have a migrate type >>> + */ >>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) >>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER >>> +#endif >>> + >>> /* >>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed >>> * costly to service. That is between allocation orders which should >>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h >>> index fc6b9c87cb0a..e73a4292ef02 100644 >>> --- a/include/linux/pageblock-flags.h >>> +++ b/include/linux/pageblock-flags.h >>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; >>> * Huge pages are a constant size, but don't exceed the maximum allocation >>> * granularity. >>> */ >>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) >>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) >>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ >>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) >>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER) >>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) >>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */ >>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ >>> -#define pageblock_order MAX_PAGE_ORDER >>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ >>> +#define pageblock_order PAGE_BLOCK_ORDER >>> #endif /* CONFIG_HUGETLB_PAGE */ >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index e113f713b493..13a5c4f6e6b6 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -989,6 +989,40 @@ config CMA_AREAS >>> If unsure, leave the default value "8" in UMA and "20" in NUMA. >>> +# >>> +# Select this config option from the architecture Kconfig, if available, to set >>> +# the max page order for physically contiguous allocations. >>> +# >>> +config ARCH_FORCE_MAX_ORDER >>> + int >>> + >>> +# >>> +# When ARCH_FORCE_MAX_ORDER is not defined, >>> +# the default page block order is MAX_PAGE_ORDER (10) as per >>> +# include/linux/mmzone.h. >>> +# >>> +config PAGE_BLOCK_ORDER >>> + int "Page Block Order" >>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0 >>> + default 10 if ARCH_FORCE_MAX_ORDER = 0 >>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >>> + help >>> + The page block order refers to the power of two number of pages that >>> + are physically contiguous and can have a migrate type associated to >>> + them. The maximum size of the page block order is limited by >>> + ARCH_FORCE_MAX_ORDER. >>> + >>> + This config allows overriding the default page block order when the >>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER >>> + or MAX_PAGE_ORDER. >>> + >>> + Reducing pageblock order can negatively impact THP generation >>> + success rate. If your workloads uses THP heavily, please use this >>> + option with caution. >>> + >>> + Don't change if unsure. >> >> >> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have >> >> CONFIG_PAGE_BLOCK_ORDER=10 >> >> >> But then, we'll do this >> >> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) >> >> >> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER. >> >> Confusing. >> >> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed. > > IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region > size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me. LIMIT might be still ambiguous, since it can be lower limit or upper limit. CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with, if it looks good to you, I can send it out properly. From 7fff4fd87ed3aa160db8d2f0d9e5b219321df4f9 Mon Sep 17 00:00:00 2001 From: Zi Yan <ziy@nvidia.com> Date: Tue, 3 Jun 2025 11:09:37 -0400 Subject: [PATCH] mm: rename CONFIG_PAGE_BLOCK_ORDER to CONFIG_PAGE_BLOCK_ORDER_CEIL. The config is in fact an additional upper limit of pageblock_order, so rename it to avoid confusion. Signed-off-by: Zi Yan <ziy@nvidia.com> --- include/linux/mmzone.h | 14 +++++++------- include/linux/pageblock-flags.h | 8 ++++---- mm/Kconfig | 15 ++++++++------- 3 files changed, 19 insertions(+), 18 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 283913d42d7b..523b407e63e8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -38,19 +38,19 @@ #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) /* Defines the order for the number of pages that have a migrate type. */ -#ifndef CONFIG_PAGE_BLOCK_ORDER -#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER +#ifndef CONFIG_PAGE_BLOCK_ORDER_CEIL +#define PAGE_BLOCK_ORDER_CEIL MAX_PAGE_ORDER #else -#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER -#endif /* CONFIG_PAGE_BLOCK_ORDER */ +#define PAGE_BLOCK_ORDER_CEIL CONFIG_PAGE_BLOCK_ORDER_CEIL +#endif /* CONFIG_PAGE_BLOCK_ORDER_CEIL */ /* * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated - * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER_CEIL, * which defines the order for the number of pages that can have a migrate type */ -#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) -#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER +#if (PAGE_BLOCK_ORDER_CEIL > MAX_PAGE_ORDER) +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER_CEIL #endif /* diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index e73a4292ef02..e7a86cd238c2 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; * Huge pages are a constant size, but don't exceed the maximum allocation * granularity. */ -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER_CEIL) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER_CEIL) #else /* CONFIG_TRANSPARENT_HUGEPAGE */ -/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ -#define pageblock_order PAGE_BLOCK_ORDER +/* If huge pages are not used, group by PAGE_BLOCK_ORDER_CEIL */ +#define pageblock_order PAGE_BLOCK_ORDER_CEIL #endif /* CONFIG_HUGETLB_PAGE */ diff --git a/mm/Kconfig b/mm/Kconfig index eccb2e46ffcb..3b27e644bd1f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1017,8 +1017,8 @@ config ARCH_FORCE_MAX_ORDER # the default page block order is MAX_PAGE_ORDER (10) as per # include/linux/mmzone.h. # -config PAGE_BLOCK_ORDER - int "Page Block Order" +config PAGE_BLOCK_ORDER_CEIL + int "Page Block Order Upper Limit" range 1 10 if ARCH_FORCE_MAX_ORDER = 0 default 10 if ARCH_FORCE_MAX_ORDER = 0 range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 @@ -1026,12 +1026,13 @@ config PAGE_BLOCK_ORDER help The page block order refers to the power of two number of pages that are physically contiguous and can have a migrate type associated to - them. The maximum size of the page block order is limited by - ARCH_FORCE_MAX_ORDER. + them. The maximum size of the page block order is at least limited by + ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER. - This config allows overriding the default page block order when the - page block order is required to be smaller than ARCH_FORCE_MAX_ORDER - or MAX_PAGE_ORDER. + This config adds a new upper limit of default page block + order when the page block order is required to be smaller than + ARCH_FORCE_MAX_ORDER/MAX_PAGE_ORDER or other limits + (see include/linux/pageblock-flags.h for details). Reducing pageblock order can negatively impact THP generation success rate. If your workloads uses THP heavily, please use this -- 2.47.2 Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-06-03 15:14 ` Zi Yan @ 2025-06-03 15:42 ` David Hildenbrand 0 siblings, 0 replies; 8+ messages in thread From: David Hildenbrand @ 2025-06-03 15:42 UTC (permalink / raw) To: Zi Yan Cc: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On 03.06.25 17:14, Zi Yan wrote: > On 3 Jun 2025, at 10:55, Zi Yan wrote: > >> On 3 Jun 2025, at 9:03, David Hildenbrand wrote: >> >>> On 21.05.25 23:57, Juan Yescas wrote: >>>> Problem: On large page size configurations (16KiB, 64KiB), the CMA >>>> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, >>>> and this causes the CMA reservations to be larger than necessary. >>>> This means that system will have less available MIGRATE_UNMOVABLE and >>>> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. >>>> >>>> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on >>>> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of >>>> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. >>>> >>>> For example, in ARM, the CMA alignment requirement when: >>>> >>>> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used >>>> - CONFIG_TRANSPARENT_HUGEPAGE is set: >>>> >>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >>>> ----------------------------------------------------------------------- >>>> 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB >>>> 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB >>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >>>> >>>> There are some extreme cases for the CMA alignment requirement when: >>>> >>>> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set >>>> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: >>>> - CONFIG_HUGETLB_PAGE is NOT set >>>> >>>> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES >>>> ------------------------------------------------------------------------ >>>> 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB >>>> 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB >>>> 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB >>>> >>>> This affects the CMA reservations for the drivers. If a driver in a >>>> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal >>>> reservation has to be 32MiB due to the alignment requirements: >>>> >>>> reserved-memory { >>>> ... >>>> cma_test_reserve: cma_test_reserve { >>>> compatible = "shared-dma-pool"; >>>> size = <0x0 0x400000>; /* 4 MiB */ >>>> ... >>>> }; >>>> }; >>>> >>>> reserved-memory { >>>> ... >>>> cma_test_reserve: cma_test_reserve { >>>> compatible = "shared-dma-pool"; >>>> size = <0x0 0x2000000>; /* 32 MiB */ >>>> ... >>>> }; >>>> }; >>>> >>>> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that >>>> allows to set the page block order in all the architectures. >>>> The maximum page block order will be given by >>>> ARCH_FORCE_MAX_ORDER. >>>> >>>> By default, CONFIG_PAGE_BLOCK_ORDER will have the same >>>> value that ARCH_FORCE_MAX_ORDER. This will make sure that >>>> current kernel configurations won't be affected by this >>>> change. It is a opt-in change. >>>> >>>> This patch will allow to have the same CMA alignment >>>> requirements for large page sizes (16KiB, 64KiB) as that >>>> in 4kb kernels by setting a lower pageblock_order. >>>> >>>> Tests: >>>> >>>> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 >>>> on 4k and 16k kernels. >>>> >>>> - Verified that Transparent Huge Pages work when pageblock_order >>>> is 1, 7, 10 on 4k and 16k kernels. >>>> >>>> - Verified that dma-buf heaps allocations work when pageblock_order >>>> is 1, 7, 10 on 4k and 16k kernels. >>>> >>>> Benchmarks: >>>> >>>> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The >>>> reason for the pageblock_order 7 is because this value makes the min >>>> CMA alignment requirement the same as that in 4kb kernels (2MB). >>>> >>>> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of >>>> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf >>>> (https://developer.android.com/ndk/guides/simpleperf) to measure >>>> the # of instructions and page-faults on 16k kernels. >>>> The benchmark was executed 10 times. The averages are below: >>>> >>>> # instructions | #page-faults >>>> order 10 | order 7 | order 10 | order 7 >>>> -------------------------------------------------------- >>>> 13,891,765,770 | 11,425,777,314 | 220 | 217 >>>> 14,456,293,487 | 12,660,819,302 | 224 | 219 >>>> 13,924,261,018 | 13,243,970,736 | 217 | 221 >>>> 13,910,886,504 | 13,845,519,630 | 217 | 221 >>>> 14,388,071,190 | 13,498,583,098 | 223 | 224 >>>> 13,656,442,167 | 12,915,831,681 | 216 | 218 >>>> 13,300,268,343 | 12,930,484,776 | 222 | 218 >>>> 13,625,470,223 | 14,234,092,777 | 219 | 218 >>>> 13,508,964,965 | 13,432,689,094 | 225 | 219 >>>> 13,368,950,667 | 13,683,587,37 | 219 | 225 >>>> ------------------------------------------------------------------- >>>> 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages >>>> >>>> There were 4.85% #instructions when order was 7, in comparison >>>> with order 10. >>>> >>>> 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) >>>> >>>> The number of page faults in order 7 and 10 were the same. >>>> >>>> These results didn't show any significant regression when the >>>> pageblock_order is set to 7 on 16kb kernels. >>>> >>>> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times >>>> on the 16k kernels with pageblock_order 7 and 10. >>>> >>>> order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % >>>> ------------------------------------------------------------------- >>>> 15.8 | 16.4 | 0.6 | 3.80% >>>> 16.4 | 16.2 | -0.2 | -1.22% >>>> 16.6 | 16.3 | -0.3 | -1.81% >>>> 16.8 | 16.3 | -0.5 | -2.98% >>>> 16.6 | 16.8 | 0.2 | 1.20% >>>> ------------------------------------------------------------------- >>>> 16.44 16.4 -0.04 -0.24% Averages >>>> >>>> The results didn't show any significant regression when the >>>> pageblock_order is set to 7 on 16kb kernels. >>>> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: Vlastimil Babka <vbabka@suse.cz> >>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> >>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>> Cc: David Hildenbrand <david@redhat.com> >>>> CC: Mike Rapoport <rppt@kernel.org> >>>> Cc: Zi Yan <ziy@nvidia.com> >>>> Cc: Suren Baghdasaryan <surenb@google.com> >>>> Cc: Minchan Kim <minchan@kernel.org> >>>> Signed-off-by: Juan Yescas <jyescas@google.com> >>>> Acked-by: Zi Yan <ziy@nvidia.com> >>>> --- >>>> Changes in v7: >>>> - Update alignment calculation to 2MiB as per David's >>>> observation. >>>> - Update page block order calculation in mm/mm_init.c for >>>> powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set. >>>> >>>> Changes in v6: >>>> - Applied the change provided by Zi Yan to fix >>>> the Kconfig. The change consists in evaluating >>>> to true or false in the if expression for range: >>>> range 1 <symbol> if <expression to eval true/false>. >>>> >>>> Changes in v5: >>>> - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The >>>> ranges with config definitions don't work in Kconfig, >>>> for example (range 1 MY_CONFIG). >>>> - Add PAGE_BLOCK_ORDER_MANUAL config for the >>>> page block order number. The default value was not >>>> defined. >>>> - Fix typos reported by Andrew. >>>> - Test default configs in powerpc. >>>> >>>> Changes in v4: >>>> - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to >>>> validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at >>>> compile time. >>>> - This change fixes the warning in: >>>> https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/ >>>> >>>> Changes in v3: >>>> - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER >>>> as per Matthew's suggestion. >>>> - Update comments in pageblock-flags.h for pageblock_order >>>> value when THP or HugeTLB are not used. >>>> >>>> Changes in v2: >>>> - Add Zi's Acked-by tag. >>>> - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as >>>> per Zi and Matthew suggestion so it is available to >>>> all the architectures. >>>> - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when >>>> ARCH_FORCE_MAX_ORDER is not available. >>>> >>>> include/linux/mmzone.h | 16 ++++++++++++++++ >>>> include/linux/pageblock-flags.h | 8 ++++---- >>>> mm/Kconfig | 34 +++++++++++++++++++++++++++++++++ >>>> mm/mm_init.c | 2 +- >>>> 4 files changed, 55 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>>> index 6ccec1bf2896..05610337bbb6 100644 >>>> --- a/include/linux/mmzone.h >>>> +++ b/include/linux/mmzone.h >>>> @@ -37,6 +37,22 @@ >>>> #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) >>>> +/* Defines the order for the number of pages that have a migrate type. */ >>>> +#ifndef CONFIG_PAGE_BLOCK_ORDER >>>> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER >>>> +#else >>>> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER >>>> +#endif /* CONFIG_PAGE_BLOCK_ORDER */ >>>> + >>>> +/* >>>> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated >>>> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, >>>> + * which defines the order for the number of pages that can have a migrate type >>>> + */ >>>> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) >>>> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER >>>> +#endif >>>> + >>>> /* >>>> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed >>>> * costly to service. That is between allocation orders which should >>>> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h >>>> index fc6b9c87cb0a..e73a4292ef02 100644 >>>> --- a/include/linux/pageblock-flags.h >>>> +++ b/include/linux/pageblock-flags.h >>>> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; >>>> * Huge pages are a constant size, but don't exceed the maximum allocation >>>> * granularity. >>>> */ >>>> -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) >>>> +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) >>>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ >>>> #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) >>>> -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER) >>>> +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) >>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */ >>>> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ >>>> -#define pageblock_order MAX_PAGE_ORDER >>>> +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ >>>> +#define pageblock_order PAGE_BLOCK_ORDER >>>> #endif /* CONFIG_HUGETLB_PAGE */ >>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>> index e113f713b493..13a5c4f6e6b6 100644 >>>> --- a/mm/Kconfig >>>> +++ b/mm/Kconfig >>>> @@ -989,6 +989,40 @@ config CMA_AREAS >>>> If unsure, leave the default value "8" in UMA and "20" in NUMA. >>>> +# >>>> +# Select this config option from the architecture Kconfig, if available, to set >>>> +# the max page order for physically contiguous allocations. >>>> +# >>>> +config ARCH_FORCE_MAX_ORDER >>>> + int >>>> + >>>> +# >>>> +# When ARCH_FORCE_MAX_ORDER is not defined, >>>> +# the default page block order is MAX_PAGE_ORDER (10) as per >>>> +# include/linux/mmzone.h. >>>> +# >>>> +config PAGE_BLOCK_ORDER >>>> + int "Page Block Order" >>>> + range 1 10 if ARCH_FORCE_MAX_ORDER = 0 >>>> + default 10 if ARCH_FORCE_MAX_ORDER = 0 >>>> + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >>>> + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 >>>> + help >>>> + The page block order refers to the power of two number of pages that >>>> + are physically contiguous and can have a migrate type associated to >>>> + them. The maximum size of the page block order is limited by >>>> + ARCH_FORCE_MAX_ORDER. >>>> + >>>> + This config allows overriding the default page block order when the >>>> + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER >>>> + or MAX_PAGE_ORDER. >>>> + >>>> + Reducing pageblock order can negatively impact THP generation >>>> + success rate. If your workloads uses THP heavily, please use this >>>> + option with caution. >>>> + >>>> + Don't change if unsure. >>> >>> >>> The semantics are now very confusing [1]. The default in x86-64 will be 10, so we'll have >>> >>> CONFIG_PAGE_BLOCK_ORDER=10 >>> >>> >>> But then, we'll do this >>> >>> #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) >>> >>> >>> So the actual pageblock order will be different than CONFIG_PAGE_BLOCK_ORDER. >>> >>> Confusing. >>> >>> Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed. >> >> IIRC, Juan's intention is to limit/lower pageblock order to reduce CMA region >> size. CONFIG_PAGE_BLOCK_ORDER_LIMIT sounds reasonable to me. > > LIMIT might be still ambiguous, since it can be lower limit or upper limit. > CONFIG_PAGE_BLOCK_ORDER_CEIL is better. Here is the patch I come up with, > if it looks good to you, I can send it out properly. LGTM -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order 2025-06-03 13:03 ` David Hildenbrand 2025-06-03 14:55 ` Zi Yan @ 2025-06-03 15:20 ` Juan Yescas 1 sibling, 0 replies; 8+ messages in thread From: Juan Yescas @ 2025-06-03 15:20 UTC (permalink / raw) To: David Hildenbrand Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm, linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim On Tue, Jun 3, 2025 at 6:03 AM David Hildenbrand <david@redhat.com> wrote: > > On 21.05.25 23:57, Juan Yescas wrote: > > Problem: On large page size configurations (16KiB, 64KiB), the CMA > > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably, > > and this causes the CMA reservations to be larger than necessary. > > This means that system will have less available MIGRATE_UNMOVABLE and > > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them. > > > > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on > > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of > > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels. > > > > For example, in ARM, the CMA alignment requirement when: > > > > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used > > - CONFIG_TRANSPARENT_HUGEPAGE is set: > > > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > > ----------------------------------------------------------------------- > > 4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB > > 16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB > > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > > > There are some extreme cases for the CMA alignment requirement when: > > > > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set > > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set: > > - CONFIG_HUGETLB_PAGE is NOT set > > > > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES > > ------------------------------------------------------------------------ > > 4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB > > 16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB > > 64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB > > > > This affects the CMA reservations for the drivers. If a driver in a > > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal > > reservation has to be 32MiB due to the alignment requirements: > > > > reserved-memory { > > ... > > cma_test_reserve: cma_test_reserve { > > compatible = "shared-dma-pool"; > > size = <0x0 0x400000>; /* 4 MiB */ > > ... > > }; > > }; > > > > reserved-memory { > > ... > > cma_test_reserve: cma_test_reserve { > > compatible = "shared-dma-pool"; > > size = <0x0 0x2000000>; /* 32 MiB */ > > ... > > }; > > }; > > > > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that > > allows to set the page block order in all the architectures. > > The maximum page block order will be given by > > ARCH_FORCE_MAX_ORDER. > > > > By default, CONFIG_PAGE_BLOCK_ORDER will have the same > > value that ARCH_FORCE_MAX_ORDER. This will make sure that > > current kernel configurations won't be affected by this > > change. It is a opt-in change. > > > > This patch will allow to have the same CMA alignment > > requirements for large page sizes (16KiB, 64KiB) as that > > in 4kb kernels by setting a lower pageblock_order. > > > > Tests: > > > > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 > > on 4k and 16k kernels. > > > > - Verified that Transparent Huge Pages work when pageblock_order > > is 1, 7, 10 on 4k and 16k kernels. > > > > - Verified that dma-buf heaps allocations work when pageblock_order > > is 1, 7, 10 on 4k and 16k kernels. > > > > Benchmarks: > > > > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The > > reason for the pageblock_order 7 is because this value makes the min > > CMA alignment requirement the same as that in 4kb kernels (2MB). > > > > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of > > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf > > (https://developer.android.com/ndk/guides/simpleperf) to measure > > the # of instructions and page-faults on 16k kernels. > > The benchmark was executed 10 times. The averages are below: > > > > # instructions | #page-faults > > order 10 | order 7 | order 10 | order 7 > > -------------------------------------------------------- > > 13,891,765,770 | 11,425,777,314 | 220 | 217 > > 14,456,293,487 | 12,660,819,302 | 224 | 219 > > 13,924,261,018 | 13,243,970,736 | 217 | 221 > > 13,910,886,504 | 13,845,519,630 | 217 | 221 > > 14,388,071,190 | 13,498,583,098 | 223 | 224 > > 13,656,442,167 | 12,915,831,681 | 216 | 218 > > 13,300,268,343 | 12,930,484,776 | 222 | 218 > > 13,625,470,223 | 14,234,092,777 | 219 | 218 > > 13,508,964,965 | 13,432,689,094 | 225 | 219 > > 13,368,950,667 | 13,683,587,37 | 219 | 225 > > ------------------------------------------------------------------- > > 13,803,137,433 | 13,131,974,268 | 220 | 220 Averages > > > > There were 4.85% #instructions when order was 7, in comparison > > with order 10. > > > > 13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%) > > > > The number of page faults in order 7 and 10 were the same. > > > > These results didn't show any significant regression when the > > pageblock_order is set to 7 on 16kb kernels. > > > > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times > > on the 16k kernels with pageblock_order 7 and 10. > > > > order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) % > > ------------------------------------------------------------------- > > 15.8 | 16.4 | 0.6 | 3.80% > > 16.4 | 16.2 | -0.2 | -1.22% > > 16.6 | 16.3 | -0.3 | -1.81% > > 16.8 | 16.3 | -0.5 | -2.98% > > 16.6 | 16.8 | 0.2 | 1.20% > > ------------------------------------------------------------------- > > 16.44 16.4 -0.04 -0.24% Averages > > > > The results didn't show any significant regression when the > > pageblock_order is set to 7 on 16kb kernels. > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: David Hildenbrand <david@redhat.com> > > CC: Mike Rapoport <rppt@kernel.org> > > Cc: Zi Yan <ziy@nvidia.com> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Minchan Kim <minchan@kernel.org> > > Signed-off-by: Juan Yescas <jyescas@google.com> > > Acked-by: Zi Yan <ziy@nvidia.com> > > --- > > Changes in v7: > > - Update alignment calculation to 2MiB as per David's > > observation. > > - Update page block order calculation in mm/mm_init.c for > > powerpc when CONFIG_HUGETLB_PAGE_SIZE_VARIABLE is set. > > > > Changes in v6: > > - Applied the change provided by Zi Yan to fix > > the Kconfig. The change consists in evaluating > > to true or false in the if expression for range: > > range 1 <symbol> if <expression to eval true/false>. > > > > Changes in v5: > > - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The > > ranges with config definitions don't work in Kconfig, > > for example (range 1 MY_CONFIG). > > - Add PAGE_BLOCK_ORDER_MANUAL config for the > > page block order number. The default value was not > > defined. > > - Fix typos reported by Andrew. > > - Test default configs in powerpc. > > > > Changes in v4: > > - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to > > validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at > > compile time. > > - This change fixes the warning in: > > https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/ > > > > Changes in v3: > > - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER > > as per Matthew's suggestion. > > - Update comments in pageblock-flags.h for pageblock_order > > value when THP or HugeTLB are not used. > > > > Changes in v2: > > - Add Zi's Acked-by tag. > > - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as > > per Zi and Matthew suggestion so it is available to > > all the architectures. > > - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when > > ARCH_FORCE_MAX_ORDER is not available. > > > > include/linux/mmzone.h | 16 ++++++++++++++++ > > include/linux/pageblock-flags.h | 8 ++++---- > > mm/Kconfig | 34 +++++++++++++++++++++++++++++++++ > > mm/mm_init.c | 2 +- > > 4 files changed, 55 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 6ccec1bf2896..05610337bbb6 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -37,6 +37,22 @@ > > > > #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1) > > > > +/* Defines the order for the number of pages that have a migrate type. */ > > +#ifndef CONFIG_PAGE_BLOCK_ORDER > > +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER > > +#else > > +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER > > +#endif /* CONFIG_PAGE_BLOCK_ORDER */ > > + > > +/* > > + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated > > + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER, > > + * which defines the order for the number of pages that can have a migrate type > > + */ > > +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER) > > +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER > > +#endif > > + > > /* > > * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed > > * costly to service. That is between allocation orders which should > > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h > > index fc6b9c87cb0a..e73a4292ef02 100644 > > --- a/include/linux/pageblock-flags.h > > +++ b/include/linux/pageblock-flags.h > > @@ -41,18 +41,18 @@ extern unsigned int pageblock_order; > > * Huge pages are a constant size, but don't exceed the maximum allocation > > * granularity. > > */ > > -#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) > > +#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER) > > > > #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ > > > > #elif defined(CONFIG_TRANSPARENT_HUGEPAGE) > > > > -#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER) > > +#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) > > > > #else /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ > > -#define pageblock_order MAX_PAGE_ORDER > > +/* If huge pages are not used, group by PAGE_BLOCK_ORDER */ > > +#define pageblock_order PAGE_BLOCK_ORDER > > > > #endif /* CONFIG_HUGETLB_PAGE */ > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index e113f713b493..13a5c4f6e6b6 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -989,6 +989,40 @@ config CMA_AREAS > > > > If unsure, leave the default value "8" in UMA and "20" in NUMA. > > > > +# > > +# Select this config option from the architecture Kconfig, if available, to set > > +# the max page order for physically contiguous allocations. > > +# > > +config ARCH_FORCE_MAX_ORDER > > + int > > + > > +# > > +# When ARCH_FORCE_MAX_ORDER is not defined, > > +# the default page block order is MAX_PAGE_ORDER (10) as per > > +# include/linux/mmzone.h. > > +# > > +config PAGE_BLOCK_ORDER > > + int "Page Block Order" > > + range 1 10 if ARCH_FORCE_MAX_ORDER = 0 > > + default 10 if ARCH_FORCE_MAX_ORDER = 0 > > + range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 > > + default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0 > > + help > > + The page block order refers to the power of two number of pages that > > + are physically contiguous and can have a migrate type associated to > > + them. The maximum size of the page block order is limited by > > + ARCH_FORCE_MAX_ORDER. > > + > > + This config allows overriding the default page block order when the > > + page block order is required to be smaller than ARCH_FORCE_MAX_ORDER > > + or MAX_PAGE_ORDER. > > + > > + Reducing pageblock order can negatively impact THP generation > > + success rate. If your workloads uses THP heavily, please use this > > + option with caution. > > + > > + Don't change if unsure. > > > The semantics are now very confusing [1]. The default in x86-64 will be > 10, so we'll have > > CONFIG_PAGE_BLOCK_ORDER=10 > > > But then, we'll do this > > #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, > PAGE_BLOCK_ORDER) > > > So the actual pageblock order will be different than > CONFIG_PAGE_BLOCK_ORDER. > > Confusing. I agree that it becomes confusing due that pageblock_order value depends on whether THP, HugeTLB are set or not. > > Either CONFIG_PAGE_BLOCK_ORDER is misnamed (CONFIG_PAGE_BLOCK_ORDER_CEIL > ? CONFIG_PAGE_BLOCK_ORDER_LIMIT ?), or the semantics should be changed. > We could rename the configuration to CONFIG_PAGE_BLOCK_ORDER_CEIL. > [1] https://gitlab.com/cki-project/kernel-ark/-/merge_requests/3928 > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-06-03 15:42 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-05-21 21:57 [PATCH v7] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas 2025-05-28 8:21 ` Vlastimil Babka 2025-05-28 18:24 ` Andrew Morton 2025-06-03 13:03 ` David Hildenbrand 2025-06-03 14:55 ` Zi Yan 2025-06-03 15:14 ` Zi Yan 2025-06-03 15:42 ` David Hildenbrand 2025-06-03 15:20 ` Juan Yescas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox