* [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
@ 2025-11-28 3:11 ` Hongru Zhang
2025-11-29 0:34 ` Barry Song
2025-11-28 3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
` (3 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-11-28 3:11 UTC (permalink / raw)
To: akpm, vbabka, david
Cc: linux-mm, linux-kernel, surenb, mhocko, jackmanb, hannes, ziy,
lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
weixugc, Hongru Zhang
From: Hongru Zhang <zhanghongru@xiaomi.com>
On mobile devices, some user-space memory management components check
memory pressure and fragmentation status periodically or via PSI, and
take actions such as killing processes or performing memory compaction
based on this information.
Under high load scenarios, reading /proc/pagetypeinfo causes memory
management components or memory allocation/free paths to be blocked
for extended periods waiting for the zone lock, leading to the
following issues:
1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on
Qcom 8750 platforms, reducing system real-time performance
2. Memory management components being blocked for extended periods,
preventing rapid acquisition of memory fragmentation information for
critical memory management decisions and actions
3. Increased latency in memory allocation and free paths due to prolonged
zone lock contention
This patch adds per-migratetype counts to the buddy allocator in
preparation for optimizing /proc/pagetypeinfo access.
The optimized implementation:
- Make per-migratetype count updates protected by zone lock on the write
side while /proc/pagetypeinfo reads are lock-free, which reduces
interrupt-disabled spinlock duration and improves system real-time
performance (addressing issue #1)
- Reduce blocking time for memory management components when reading
/proc/pagetypeinfo, enabling more rapid acquisition of memory
fragmentation information (addressing issue #2)
- Minimize the critical section held during /proc/pagetypeinfo reads to
reduce zone lock contention on memory allocation and free paths
(addressing issue #3)
The main overhead is a slight increase in latency on the memory
allocation and free paths due to additional per-migratetype counting,
with theoretically minimal impact on overall performance.
Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
---
include/linux/mmzone.h | 1 +
mm/mm_init.c | 1 +
mm/page_alloc.c | 7 ++++++-
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..6eeefe6a3727 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -138,6 +138,7 @@ extern int page_group_by_mobility_disabled;
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
+ unsigned long mt_nr_free[MIGRATE_TYPES];
};
struct pglist_data;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7712d887b696..dca2be8cc3b1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1439,6 +1439,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
for_each_migratetype_order(order, t) {
INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
zone->free_area[order].nr_free = 0;
+ zone->free_area[order].mt_nr_free[t] = 0;
}
#ifdef CONFIG_UNACCEPTED_MEMORY
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..9431073e7255 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -818,6 +818,7 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
else
list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
+ area->mt_nr_free[migratetype]++;
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
@@ -840,6 +841,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
+ area->mt_nr_free[old_mt]--;
+ area->mt_nr_free[new_mt]++;
account_freepages(zone, -nr_pages, old_mt);
account_freepages(zone, nr_pages, new_mt);
@@ -855,6 +858,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
{
+ struct free_area *area = &zone->free_area[order];
int nr_pages = 1 << order;
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
@@ -868,7 +872,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
list_del(&page->buddy_list);
__ClearPageBuddy(page);
set_page_private(page, 0);
- zone->free_area[order].nr_free--;
+ area->nr_free--;
+ area->mt_nr_free[migratetype]--;
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
--
2.43.0
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator
2025-11-28 3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
@ 2025-11-29 0:34 ` Barry Song
0 siblings, 0 replies; 20+ messages in thread
From: Barry Song @ 2025-11-29 0:34 UTC (permalink / raw)
To: Hongru Zhang
Cc: akpm, vbabka, david, linux-mm, linux-kernel, surenb, mhocko,
jackmanb, hannes, ziy, lorenzo.stoakes, Liam.Howlett, rppt,
axelrasmussen, yuanchu, weixugc, Hongru Zhang
On Fri, Nov 28, 2025 at 11:12 AM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed82ee55e66a..9431073e7255 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -818,6 +818,7 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
> else
> list_add(&page->buddy_list, &area->free_list[migratetype]);
> area->nr_free++;
> + area->mt_nr_free[migratetype]++;
>
> if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
> @@ -840,6 +841,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> get_pageblock_migratetype(page), old_mt, nr_pages);
>
> list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
> + area->mt_nr_free[old_mt]--;
> + area->mt_nr_free[new_mt]++;
The overhead comes from effectively counting twice. Have we checked whether
the readers of area->nr_free are on a hot path? If not, we might just drop
nr_free and compute the sum each time.
Buddyinfo and compaction do not seem to be on a hot path ?
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28 3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
@ 2025-11-28 3:12 ` Hongru Zhang
2025-11-28 12:03 ` zhongjinji
2025-11-28 3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
` (2 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-11-28 3:12 UTC (permalink / raw)
To: akpm, vbabka, david
Cc: linux-mm, linux-kernel, surenb, mhocko, jackmanb, hannes, ziy,
lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
weixugc, Hongru Zhang
From: Hongru Zhang <zhanghongru@xiaomi.com>
This patch optimizes /proc/pagetypeinfo access by utilizing the
per-migratetype free page block counts already maintained by the buddy
allocator, instead of iterating through free lists under zone lock.
Accuracy. Both implementations have accuracy limitations. The previous
implementation required acquiring and releasing the zone lock for counting
each order and migratetype, making it potentially inaccurate. Under high
memory pressure, accuracy would further degrade due to zone lock
contention or fragmentation. The new implementation collects data within a
short time window, which helps maintain relatively small errors, and is
unaffected by memory pressure. Furthermore, user-space memory management
components inherently experience decision latency - by the time they
process the collected data and execute actions, the memory state has
already changed. This means that even perfectly accurate data at
collection time becomes stale by decision time. Considering these factors,
the accuracy trade-off introduced by the new implementation should be
acceptable for practical use cases, offering a balance between performance
and accuracy requirements.
Performance benefits:
System setup:
- 12th Gen Intel(R) Core(TM) i7-12700
- 1 NUMA node, 16G memory in total
- Turbo disabled
- cpufreq governor set to performance
1. Average latency over 10,000 /proc/pagetypeinfo accesses
+-----------------------+----------+------------+
| | no-patch | with-patch |
+-----------------------+----------+------------+
| Just after boot | 700.9 us | 268.6 us |
+-----------------------+----------+------------+
| After building kernel | 28.7 ms | 269.8 us |
+-----------------------+----------+------------+
2. Page alloc/free latency with concurrent /proc/pagetypeinfo access
Test setup:
- Using config-pagealloc-micro
- Monitor set to proc-pagetypeinfo, update frequency set to 10ms
- PAGEALLOC_ORDER_MIN=4, PAGEALLOC_ORDER_MAX=4
Without patch test results:
vanilla vanilla
no-monitor monitor
Min alloc-odr4-1 8539.00 ( 0.00%) 8762.00 ( -2.61%)
Min alloc-odr4-2 6501.00 ( 0.00%) 6683.00 ( -2.80%)
Min alloc-odr4-4 5537.00 ( 0.00%) 5873.00 ( -6.07%)
Min alloc-odr4-8 5030.00 ( 0.00%) 5361.00 ( -6.58%)
Min alloc-odr4-16 4782.00 ( 0.00%) 5162.00 ( -7.95%)
Min alloc-odr4-32 5838.00 ( 0.00%) 6499.00 ( -11.32%)
Min alloc-odr4-64 6565.00 ( 0.00%) 7413.00 ( -12.92%)
Min alloc-odr4-128 6896.00 ( 0.00%) 7898.00 ( -14.53%)
Min alloc-odr4-256 7303.00 ( 0.00%) 8163.00 ( -11.78%)
Min alloc-odr4-512 10179.00 ( 0.00%) 11985.00 ( -17.74%)
Min alloc-odr4-1024 11000.00 ( 0.00%) 12165.00 ( -10.59%)
Min free-odr4-1 820.00 ( 0.00%) 1230.00 ( -50.00%)
Min free-odr4-2 511.00 ( 0.00%) 952.00 ( -86.30%)
Min free-odr4-4 347.00 ( 0.00%) 434.00 ( -25.07%)
Min free-odr4-8 286.00 ( 0.00%) 399.00 ( -39.51%)
Min free-odr4-16 250.00 ( 0.00%) 405.00 ( -62.00%)
Min free-odr4-32 294.00 ( 0.00%) 405.00 ( -37.76%)
Min free-odr4-64 333.00 ( 0.00%) 363.00 ( -9.01%)
Min free-odr4-128 340.00 ( 0.00%) 412.00 ( -21.18%)
Min free-odr4-256 339.00 ( 0.00%) 329.00 ( 2.95%)
Min free-odr4-512 361.00 ( 0.00%) 409.00 ( -13.30%)
Min free-odr4-1024 300.00 ( 0.00%) 361.00 ( -20.33%)
Stddev alloc-odr4-1 7.29 ( 0.00%) 90.78 (-1146.00%)
Stddev alloc-odr4-2 3.87 ( 0.00%) 51.30 (-1225.75%)
Stddev alloc-odr4-4 3.20 ( 0.00%) 50.90 (-1491.24%)
Stddev alloc-odr4-8 4.67 ( 0.00%) 52.23 (-1019.35%)
Stddev alloc-odr4-16 5.72 ( 0.00%) 27.53 (-381.04%)
Stddev alloc-odr4-32 6.25 ( 0.00%) 641.23 (-10154.46%)
Stddev alloc-odr4-64 2.06 ( 0.00%) 386.99 (-18714.22%)
Stddev alloc-odr4-128 14.36 ( 0.00%) 52.39 (-264.77%)
Stddev alloc-odr4-256 32.42 ( 0.00%) 326.19 (-906.05%)
Stddev alloc-odr4-512 65.58 ( 0.00%) 184.49 (-181.31%)
Stddev alloc-odr4-1024 8.88 ( 0.00%) 153.01 (-1622.67%)
Stddev free-odr4-1 2.29 ( 0.00%) 152.27 (-6549.85%)
Stddev free-odr4-2 10.99 ( 0.00%) 73.10 (-564.89%)
Stddev free-odr4-4 1.99 ( 0.00%) 28.40 (-1324.45%)
Stddev free-odr4-8 2.51 ( 0.00%) 52.93 (-2007.64%)
Stddev free-odr4-16 2.85 ( 0.00%) 26.04 (-814.88%)
Stddev free-odr4-32 4.04 ( 0.00%) 27.05 (-569.79%)
Stddev free-odr4-64 2.10 ( 0.00%) 48.07 (-2185.66%)
Stddev free-odr4-128 2.63 ( 0.00%) 26.23 (-897.86%)
Stddev free-odr4-256 6.29 ( 0.00%) 37.04 (-488.71%)
Stddev free-odr4-512 2.56 ( 0.00%) 10.65 (-315.28%)
Stddev free-odr4-1024 0.95 ( 0.00%) 6.46 (-582.22%)
Max alloc-odr4-1 8564.00 ( 0.00%) 9099.00 ( -6.25%)
Max alloc-odr4-2 6511.00 ( 0.00%) 6844.00 ( -5.11%)
Max alloc-odr4-4 5549.00 ( 0.00%) 6038.00 ( -8.81%)
Max alloc-odr4-8 5045.00 ( 0.00%) 5551.00 ( -10.03%)
Max alloc-odr4-16 4800.00 ( 0.00%) 5257.00 ( -9.52%)
Max alloc-odr4-32 5861.00 ( 0.00%) 8115.00 ( -38.46%)
Max alloc-odr4-64 6571.00 ( 0.00%) 8292.00 ( -26.19%)
Max alloc-odr4-128 6930.00 ( 0.00%) 8081.00 ( -16.61%)
Max alloc-odr4-256 7372.00 ( 0.00%) 9150.00 ( -24.12%)
Max alloc-odr4-512 10333.00 ( 0.00%) 12636.00 ( -22.29%)
Max alloc-odr4-1024 11035.00 ( 0.00%) 12590.00 ( -14.09%)
Max free-odr4-1 828.00 ( 0.00%) 1724.00 (-108.21%)
Max free-odr4-2 543.00 ( 0.00%) 1192.00 (-119.52%)
Max free-odr4-4 354.00 ( 0.00%) 519.00 ( -46.61%)
Max free-odr4-8 293.00 ( 0.00%) 617.00 (-110.58%)
Max free-odr4-16 260.00 ( 0.00%) 483.00 ( -85.77%)
Max free-odr4-32 308.00 ( 0.00%) 488.00 ( -58.44%)
Max free-odr4-64 341.00 ( 0.00%) 505.00 ( -48.09%)
Max free-odr4-128 346.00 ( 0.00%) 497.00 ( -43.64%)
Max free-odr4-256 353.00 ( 0.00%) 463.00 ( -31.16%)
Max free-odr4-512 367.00 ( 0.00%) 442.00 ( -20.44%)
Max free-odr4-1024 303.00 ( 0.00%) 381.00 ( -25.74%)
With patch test results:
patched patched
no-monitor monitor
Min alloc-odr4-1 8488.00 ( 0.00%) 8514.00 ( -0.31%)
Min alloc-odr4-2 6551.00 ( 0.00%) 6527.00 ( 0.37%)
Min alloc-odr4-4 5536.00 ( 0.00%) 5591.00 ( -0.99%)
Min alloc-odr4-8 5008.00 ( 0.00%) 5098.00 ( -1.80%)
Min alloc-odr4-16 4760.00 ( 0.00%) 4857.00 ( -2.04%)
Min alloc-odr4-32 5827.00 ( 0.00%) 5919.00 ( -1.58%)
Min alloc-odr4-64 6561.00 ( 0.00%) 6680.00 ( -1.81%)
Min alloc-odr4-128 6898.00 ( 0.00%) 7014.00 ( -1.68%)
Min alloc-odr4-256 7311.00 ( 0.00%) 7464.00 ( -2.09%)
Min alloc-odr4-512 10181.00 ( 0.00%) 10286.00 ( -1.03%)
Min alloc-odr4-1024 11205.00 ( 0.00%) 11725.00 ( -4.64%)
Min free-odr4-1 789.00 ( 0.00%) 867.00 ( -9.89%)
Min free-odr4-2 490.00 ( 0.00%) 526.00 ( -7.35%)
Min free-odr4-4 350.00 ( 0.00%) 360.00 ( -2.86%)
Min free-odr4-8 272.00 ( 0.00%) 287.00 ( -5.51%)
Min free-odr4-16 247.00 ( 0.00%) 254.00 ( -2.83%)
Min free-odr4-32 298.00 ( 0.00%) 304.00 ( -2.01%)
Min free-odr4-64 334.00 ( 0.00%) 325.00 ( 2.69%)
Min free-odr4-128 334.00 ( 0.00%) 329.00 ( 1.50%)
Min free-odr4-256 336.00 ( 0.00%) 336.00 ( 0.00%)
Min free-odr4-512 360.00 ( 0.00%) 342.00 ( 5.00%)
Min free-odr4-1024 327.00 ( 0.00%) 355.00 ( -8.56%)
Stddev alloc-odr4-1 5.19 ( 0.00%) 45.38 (-775.09%)
Stddev alloc-odr4-2 6.99 ( 0.00%) 37.63 (-437.98%)
Stddev alloc-odr4-4 3.91 ( 0.00%) 17.85 (-356.28%)
Stddev alloc-odr4-8 5.15 ( 0.00%) 9.34 ( -81.47%)
Stddev alloc-odr4-16 3.83 ( 0.00%) 5.34 ( -39.34%)
Stddev alloc-odr4-32 1.96 ( 0.00%) 10.28 (-425.09%)
Stddev alloc-odr4-64 1.32 ( 0.00%) 333.30 (-25141.39%)
Stddev alloc-odr4-128 2.06 ( 0.00%) 7.37 (-258.28%)
Stddev alloc-odr4-256 15.56 ( 0.00%) 113.48 (-629.25%)
Stddev alloc-odr4-512 61.25 ( 0.00%) 165.09 (-169.53%)
Stddev alloc-odr4-1024 18.89 ( 0.00%) 2.93 ( 84.51%)
Stddev free-odr4-1 4.45 ( 0.00%) 40.12 (-800.98%)
Stddev free-odr4-2 1.50 ( 0.00%) 29.30 (-1850.31%)
Stddev free-odr4-4 1.27 ( 0.00%) 19.49 (-1439.40%)
Stddev free-odr4-8 0.97 ( 0.00%) 8.93 (-823.07%)
Stddev free-odr4-16 8.38 ( 0.00%) 4.51 ( 46.21%)
Stddev free-odr4-32 3.18 ( 0.00%) 6.59 (-107.42%)
Stddev free-odr4-64 2.40 ( 0.00%) 3.09 ( -28.50%)
Stddev free-odr4-128 1.55 ( 0.00%) 2.53 ( -62.92%)
Stddev free-odr4-256 0.41 ( 0.00%) 2.80 (-585.57%)
Stddev free-odr4-512 1.60 ( 0.00%) 4.84 (-202.08%)
Stddev free-odr4-1024 0.66 ( 0.00%) 1.19 ( -80.68%)
Max alloc-odr4-1 8505.00 ( 0.00%) 8676.00 ( -2.01%)
Max alloc-odr4-2 6572.00 ( 0.00%) 6651.00 ( -1.20%)
Max alloc-odr4-4 5552.00 ( 0.00%) 5646.00 ( -1.69%)
Max alloc-odr4-8 5024.00 ( 0.00%) 5131.00 ( -2.13%)
Max alloc-odr4-16 4774.00 ( 0.00%) 4875.00 ( -2.12%)
Max alloc-odr4-32 5834.00 ( 0.00%) 5950.00 ( -1.99%)
Max alloc-odr4-64 6565.00 ( 0.00%) 7434.00 ( -13.24%)
Max alloc-odr4-128 6907.00 ( 0.00%) 7034.00 ( -1.84%)
Max alloc-odr4-256 7347.00 ( 0.00%) 7843.00 ( -6.75%)
Max alloc-odr4-512 10315.00 ( 0.00%) 10866.00 ( -5.34%)
Max alloc-odr4-1024 11278.00 ( 0.00%) 11733.00 ( -4.03%)
Max free-odr4-1 803.00 ( 0.00%) 1009.00 ( -25.65%)
Max free-odr4-2 495.00 ( 0.00%) 607.00 ( -22.63%)
Max free-odr4-4 354.00 ( 0.00%) 417.00 ( -17.80%)
Max free-odr4-8 275.00 ( 0.00%) 313.00 ( -13.82%)
Max free-odr4-16 273.00 ( 0.00%) 272.00 ( 0.37%)
Max free-odr4-32 309.00 ( 0.00%) 324.00 ( -4.85%)
Max free-odr4-64 340.00 ( 0.00%) 335.00 ( 1.47%)
Max free-odr4-128 340.00 ( 0.00%) 338.00 ( 0.59%)
Max free-odr4-256 338.00 ( 0.00%) 346.00 ( -2.37%)
Max free-odr4-512 364.00 ( 0.00%) 359.00 ( 1.37%)
Max free-odr4-1024 329.00 ( 0.00%) 359.00 ( -9.12%)
Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
---
mm/page_alloc.c | 10 ++++++----
mm/vmstat.c | 30 +++++++-----------------------
2 files changed, 13 insertions(+), 27 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9431073e7255..a90f2bf735f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -818,7 +818,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
else
list_add(&page->buddy_list, &area->free_list[migratetype]);
area->nr_free++;
- area->mt_nr_free[migratetype]++;
+ WRITE_ONCE(area->mt_nr_free[migratetype],
+ area->mt_nr_free[migratetype] + 1);
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
@@ -841,8 +842,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- area->mt_nr_free[old_mt]--;
- area->mt_nr_free[new_mt]++;
+ WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
+ WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
account_freepages(zone, -nr_pages, old_mt);
account_freepages(zone, nr_pages, new_mt);
@@ -873,7 +874,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
area->nr_free--;
- area->mt_nr_free[migratetype]--;
+ WRITE_ONCE(area->mt_nr_free[migratetype],
+ area->mt_nr_free[migratetype] - 1);
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..9334bbbe1e16 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
zone->name,
migratetype_names[mtype]);
for (order = 0; order < NR_PAGE_ORDERS; ++order) {
- unsigned long freecount = 0;
- struct free_area *area;
- struct list_head *curr;
+ unsigned long freecount;
bool overflow = false;
- area = &(zone->free_area[order]);
-
- list_for_each(curr, &area->free_list[mtype]) {
- /*
- * Cap the free_list iteration because it might
- * be really large and we are under a spinlock
- * so a long time spent here could trigger a
- * hard lockup detector. Anyway this is a
- * debugging tool so knowing there is a handful
- * of pages of this order should be more than
- * sufficient.
- */
- if (++freecount >= 100000) {
- overflow = true;
- break;
- }
+ /* Keep the same output format for user-space tools compatibility */
+ freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ if (freecount >= 100000) {
+ overflow = true;
+ freecount = 100000;
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
- spin_unlock_irq(&zone->lock);
- cond_resched();
- spin_lock_irq(&zone->lock);
}
seq_putc(m, '\n');
}
@@ -1633,7 +1617,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)
seq_printf(m, "%6d ", order);
seq_putc(m, '\n');
- walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print);
+ walk_zones_in_node(m, pgdat, true, true, pagetypeinfo_showfree_print);
}
static void pagetypeinfo_showblockcount_print(struct seq_file *m,
--
2.43.0
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-11-28 3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
@ 2025-11-28 12:03 ` zhongjinji
2025-11-29 0:00 ` Barry Song
0 siblings, 1 reply; 20+ messages in thread
From: zhongjinji @ 2025-11-28 12:03 UTC (permalink / raw)
To: zhanghongru06
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt, surenb,
vbabka, weixugc, yuanchu, zhanghongru, ziy
Hi, Hongru
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9431073e7255..a90f2bf735f6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -818,7 +818,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
> else
> list_add(&page->buddy_list, &area->free_list[migratetype]);
> area->nr_free++;
> - area->mt_nr_free[migratetype]++;
> + WRITE_ONCE(area->mt_nr_free[migratetype],
> + area->mt_nr_free[migratetype] + 1);
>
> if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
> @@ -841,8 +842,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> get_pageblock_migratetype(page), old_mt, nr_pages);
>
> list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
> - area->mt_nr_free[old_mt]--;
> - area->mt_nr_free[new_mt]++;
> + WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
> + WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
>
> account_freepages(zone, -nr_pages, old_mt);
> account_freepages(zone, nr_pages, new_mt);
> @@ -873,7 +874,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
> __ClearPageBuddy(page);
> set_page_private(page, 0);
> area->nr_free--;
> - area->mt_nr_free[migratetype]--;
> + WRITE_ONCE(area->mt_nr_free[migratetype],
> + area->mt_nr_free[migratetype] - 1);
It doesn't seem like a good idea to use WRITE_ONCE on the hot path.
>
> if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index bb09c032eecf..9334bbbe1e16 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> zone->name,
> migratetype_names[mtype]);
> for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> - unsigned long freecount = 0;
> - struct free_area *area;
> - struct list_head *curr;
> + unsigned long freecount;
> bool overflow = false;
>
> - area = &(zone->free_area[order]);
> -
> - list_for_each(curr, &area->free_list[mtype]) {
> - /*
> - * Cap the free_list iteration because it might
> - * be really large and we are under a spinlock
> - * so a long time spent here could trigger a
> - * hard lockup detector. Anyway this is a
> - * debugging tool so knowing there is a handful
> - * of pages of this order should be more than
> - * sufficient.
> - */
> - if (++freecount >= 100000) {
> - overflow = true;
> - break;
> - }
> + /* Keep the same output format for user-space tools compatibility */
> + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
I think it might be better for using an array of size NR_PAGE_ORDERS to store
the free count for each order. Like the code below.
unsigned long freecount[NR_PAGE_ORDERS]
spin_lock_irq(&zone->lock)
for_each_order
freecount[order] = zone->free_area[order].mt_nr_free[mtype]
spin_unlock_irq(&zone->lock)
for_each_order
print freecount[order]
> + if (freecount >= 100000) {
> + overflow = true;
> + freecount = 100000;
> }
> seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
> - spin_unlock_irq(&zone->lock);
> - cond_resched();
> - spin_lock_irq(&zone->lock);
> }
> seq_putc(m, '\n');
> }
> @@ -1633,7 +1617,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)
> seq_printf(m, "%6d ", order);
> seq_putc(m, '\n');
>
> - walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print);
> + walk_zones_in_node(m, pgdat, true, true, pagetypeinfo_showfree_print);
> }
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-11-28 12:03 ` zhongjinji
@ 2025-11-29 0:00 ` Barry Song
2025-11-29 7:55 ` Barry Song
2025-12-01 12:29 ` Hongru Zhang
0 siblings, 2 replies; 20+ messages in thread
From: Barry Song @ 2025-11-29 0:00 UTC (permalink / raw)
To: zhongjinji
Cc: zhanghongru06, Liam.Howlett, akpm, axelrasmussen, david, hannes,
jackmanb, linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt,
surenb, vbabka, weixugc, yuanchu, zhanghongru, ziy
> > if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> > __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index bb09c032eecf..9334bbbe1e16 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > zone->name,
> > migratetype_names[mtype]);
> > for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> > - unsigned long freecount = 0;
> > - struct free_area *area;
> > - struct list_head *curr;
> > + unsigned long freecount;
> > bool overflow = false;
> >
> > - area = &(zone->free_area[order]);
> > -
> > - list_for_each(curr, &area->free_list[mtype]) {
> > - /*
> > - * Cap the free_list iteration because it might
> > - * be really large and we are under a spinlock
> > - * so a long time spent here could trigger a
> > - * hard lockup detector. Anyway this is a
> > - * debugging tool so knowing there is a handful
> > - * of pages of this order should be more than
> > - * sufficient.
> > - */
> > - if (++freecount >= 100000) {
> > - overflow = true;
> > - break;
> > - }
> > + /* Keep the same output format for user-space tools compatibility */
> > + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
>
> I think it might be better for using an array of size NR_PAGE_ORDERS to store
> the free count for each order. Like the code below.
Right. If we want the freecount to accurately reflect the current system
state, we still need to take the zone lock.
Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
correctness. They may ensure single-copy atomicity per access, but not for the
overall result.
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-11-29 0:00 ` Barry Song
@ 2025-11-29 7:55 ` Barry Song
2025-12-01 12:29 ` Hongru Zhang
1 sibling, 0 replies; 20+ messages in thread
From: Barry Song @ 2025-11-29 7:55 UTC (permalink / raw)
To: zhongjinji
Cc: zhanghongru06, Liam.Howlett, akpm, axelrasmussen, david, hannes,
jackmanb, linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt,
surenb, vbabka, weixugc, yuanchu, zhanghongru, ziy
On Sat, Nov 29, 2025 at 8:00 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > > if (order >= pageblock_order && !is_migrate_isolate(migratetype))
> > > __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index bb09c032eecf..9334bbbe1e16 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > > zone->name,
> > > migratetype_names[mtype]);
> > > for (order = 0; order < NR_PAGE_ORDERS; ++order) {
> > > - unsigned long freecount = 0;
> > > - struct free_area *area;
> > > - struct list_head *curr;
> > > + unsigned long freecount;
> > > bool overflow = false;
> > >
> > > - area = &(zone->free_area[order]);
> > > -
> > > - list_for_each(curr, &area->free_list[mtype]) {
> > > - /*
> > > - * Cap the free_list iteration because it might
> > > - * be really large and we are under a spinlock
> > > - * so a long time spent here could trigger a
> > > - * hard lockup detector. Anyway this is a
> > > - * debugging tool so knowing there is a handful
> > > - * of pages of this order should be more than
> > > - * sufficient.
> > > - */
> > > - if (++freecount >= 100000) {
> > > - overflow = true;
> > > - break;
> > > - }
> > > + /* Keep the same output format for user-space tools compatibility */
> > > + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> >
> > I think it might be better for using an array of size NR_PAGE_ORDERS to store
> > the free count for each order. Like the code below.
>
> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.
>
> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.
On second thought, the original code releases and re-acquires the spinlock
for each order, so cross-variable consistency may not be a real issue.
Adding data_race() to silence KCSAN warnings should be sufficient?
I mean something like the following.
@@ -843,8 +842,8 @@ static inline void move_to_free_list(struct page
*page, struct zone *zone,
get_pageblock_migratetype(page), old_mt, nr_pages);
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
- WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1);
- WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1);
+ area->mt_nr_free[old_mt]--;
+ area->mt_nr_free[new_mt]++;
account_freepages(zone, -nr_pages, old_mt);
account_freepages(zone, nr_pages, new_mt);
@@ -875,8 +874,7 @@ static inline void
__del_page_from_free_list(struct page *page, struct zone *zon
__ClearPageBuddy(page);
set_page_private(page, 0);
area->nr_free--;
- WRITE_ONCE(area->mt_nr_free[migratetype],
- area->mt_nr_free[migratetype] - 1);
+ area->mt_nr_free[migratetype]--;
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7e1e931eb209..d74004eb8c4d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1599,7 +1599,7 @@ static void pagetypeinfo_showfree_print(struct
seq_file *m,
bool overflow = false;
/* Keep the same output format for user-space
tools compatibility */
- freecount =
READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ freecount =
data_race(zone->free_area[order].mt_nr_free[mtype]);
if (freecount >= 100000) {
overflow = true;
freecount = 100000;
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-11-29 0:00 ` Barry Song
2025-11-29 7:55 ` Barry Song
@ 2025-12-01 12:29 ` Hongru Zhang
2025-12-01 18:54 ` Barry Song
1 sibling, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-12-01 12:29 UTC (permalink / raw)
To: 21cnbao, zhongjinji
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt, surenb,
vbabka, weixugc, yuanchu, zhanghongru06, zhanghongru, ziy
> Right. If we want the freecount to accurately reflect the current system
> state, we still need to take the zone lock.
Yeah, as I mentioned in patch (2/3), this implementation has accuracy
limitation:
"Accuracy. Both implementations have accuracy limitations. The previous
implementation required acquiring and releasing the zone lock for counting
each order and migratetype, making it potentially inaccurate. Under high
memory pressure, accuracy would further degrade due to zone lock
contention or fragmentation. The new implementation collects data within a
short time window, which helps maintain relatively small errors, and is
unaffected by memory pressure. Furthermore, user-space memory management
components inherently experience decision latency - by the time they
process the collected data and execute actions, the memory state has
already changed. This means that even perfectly accurate data at
collection time becomes stale by decision time. Considering these factors,
the accuracy trade-off introduced by the new implementation should be
acceptable for practical use cases, offering a balance between performance
and accuracy requirements."
Additional data:
1. average latency of pagetypeinfo_showfree_print() over 1,000,000
times is 4.67 us
2. average latency is 125 ns, if seq_printf() is taken out of the loop
Example code:
+unsigned long total_lat = 0;
+unsigned long total_count = 0;
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
int order, mtype;
+ ktime_t start;
+ u64 lat;
+ unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
+
+ start = ktime_get();
+ for (order = 0; order < NR_PAGE_ORDERS; ++order)
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+
+ lat = ktime_to_ns(ktime_sub(ktime_get(), start));
+ total_count++;
+ total_lat += lat;
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
bool overflow = false;
/* Keep the same output format for user-space tools compatibility */
- freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
+ freecount = freecounts[order][mtype];
if (freecount >= 100000) {
overflow = true;
freecount = 100000;
@@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
#endif /* CONFIG_PAGE_OWNER */
}
I think both are small time window (if IRQ is disabled, latency is more
deterministic).
> Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> correctness. They may ensure single-copy atomicity per access, but not for the
> overall result.
I know this does not guarantee correctness of the overall result.
READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
store tearing and read tearing caused by compiler optimizations.
In fact, I have already noticed /proc/buddyinfo, which collects data under
zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
we could remove its zone lock as well, for the same reasons as
/proc/pagetypeinfo.
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count
2025-12-01 12:29 ` Hongru Zhang
@ 2025-12-01 18:54 ` Barry Song
0 siblings, 0 replies; 20+ messages in thread
From: Barry Song @ 2025-12-01 18:54 UTC (permalink / raw)
To: Hongru Zhang
Cc: zhongjinji, Liam.Howlett, akpm, axelrasmussen, david, hannes,
jackmanb, linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt,
surenb, vbabka, weixugc, yuanchu, zhanghongru, ziy
On Mon, Dec 1, 2025 at 8:29 PM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
> > Right. If we want the freecount to accurately reflect the current system
> > state, we still need to take the zone lock.
>
> Yeah, as I mentioned in patch (2/3), this implementation has accuracy
> limitation:
>
> "Accuracy. Both implementations have accuracy limitations. The previous
> implementation required acquiring and releasing the zone lock for counting
> each order and migratetype, making it potentially inaccurate. Under high
> memory pressure, accuracy would further degrade due to zone lock
> contention or fragmentation. The new implementation collects data within a
> short time window, which helps maintain relatively small errors, and is
> unaffected by memory pressure. Furthermore, user-space memory management
> components inherently experience decision latency - by the time they
> process the collected data and execute actions, the memory state has
> already changed. This means that even perfectly accurate data at
> collection time becomes stale by decision time. Considering these factors,
> the accuracy trade-off introduced by the new implementation should be
> acceptable for practical use cases, offering a balance between performance
> and accuracy requirements."
>
> Additional data:
> 1. average latency of pagetypeinfo_showfree_print() over 1,000,000
> times is 4.67 us
>
> 2. average latency is 125 ns, if seq_printf() is taken out of the loop
>
> Example code:
>
> +unsigned long total_lat = 0;
> +unsigned long total_count = 0;
> +
> static void pagetypeinfo_showfree_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> {
> int order, mtype;
> + ktime_t start;
> + u64 lat;
> + unsigned long freecounts[NR_PAGE_ORDERS][MIGRATE_TYPES]; /* ignore potential stack overflow */
> +
> + start = ktime_get();
> + for (order = 0; order < NR_PAGE_ORDERS; ++order)
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
> + freecounts[order][mtype] = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> +
> + lat = ktime_to_ns(ktime_sub(ktime_get(), start));
> + total_count++;
> + total_lat += lat;
>
> for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1594,7 +1609,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> bool overflow = false;
>
> /* Keep the same output format for user-space tools compatibility */
> - freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]);
> + freecount = freecounts[order][mtype];
> if (freecount >= 100000) {
> overflow = true;
> freecount = 100000;
> @@ -1692,6 +1707,13 @@ static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
> #endif /* CONFIG_PAGE_OWNER */
> }
>
> I think both are small time window (if IRQ is disabled, latency is more
> deterministic).
>
> > Multiple independent WRITE_ONCE and READ_ONCE operations do not guarantee
> > correctness. They may ensure single-copy atomicity per access, but not for the
> > overall result.
>
> I know this does not guarantee correctness of the overall result.
> READ_ONCE() and WRITE_ONCE() in this patch are used to avoid potential
> store tearing and read tearing caused by compiler optimizations.
Yes, I realized that correctness might not be a major concern, so I sent a
follow-up email [1] after replying to you.
>
> In fact, I have already noticed /proc/buddyinfo, which collects data under
> zone lock and uses data_race to avoid KCSAN reports. But I'm wondering if
> we could remove its zone lock as well, for the same reasons as
> /proc/pagetypeinfo.
That might be correct. However, if it doesn’t significantly affect performance
and buddyinfo is accessed much less frequently than the buddy list, we may
just leave it as is.
[1] https://lore.kernel.org/linux-mm/CAGsJ_4wUQdQyB_3y0Buf3uG34hvgpMAP3qHHwJM3=R01RJOuvw@mail.gmail.com/
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28 3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-28 3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
@ 2025-11-28 3:12 ` Hongru Zhang
2025-11-29 0:04 ` Barry Song
2025-11-28 7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28 9:24 ` Vlastimil Babka
4 siblings, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-11-28 3:12 UTC (permalink / raw)
To: akpm, vbabka, david
Cc: linux-mm, linux-kernel, surenb, mhocko, jackmanb, hannes, ziy,
lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
weixugc, Hongru Zhang
From: Hongru Zhang <zhanghongru@xiaomi.com>
Use per-migratetype counts instead of list_empty() helps reduce a
few cpu instructions.
Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
---
mm/internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..7759f8fdf445 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
static inline bool free_area_empty(struct free_area *area, int migratetype)
{
- return list_empty(&area->free_list[migratetype]);
+ return !READ_ONCE(area->mt_nr_free[migratetype]);
}
/* mm/util.c */
--
2.43.0
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts
2025-11-28 3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
@ 2025-11-29 0:04 ` Barry Song
2025-11-29 9:24 ` Barry Song
0 siblings, 1 reply; 20+ messages in thread
From: Barry Song @ 2025-11-29 0:04 UTC (permalink / raw)
To: Hongru Zhang
Cc: akpm, vbabka, david, linux-mm, linux-kernel, surenb, mhocko,
jackmanb, hannes, ziy, lorenzo.stoakes, Liam.Howlett, rppt,
axelrasmussen, yuanchu, weixugc, Hongru Zhang
On Fri, Nov 28, 2025 at 11:13 AM Hongru Zhang <zhanghongru06@gmail.com> wrote:
>
> From: Hongru Zhang <zhanghongru@xiaomi.com>
>
> Use per-migratetype counts instead of list_empty() helps reduce a
> few cpu instructions.
>
> Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
> ---
> mm/internal.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..7759f8fdf445 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>
> static inline bool free_area_empty(struct free_area *area, int migratetype)
> {
> - return list_empty(&area->free_list[migratetype]);
> + return !READ_ONCE(area->mt_nr_free[migratetype]);
I'm not quite sure about this. Since the counter is written and read more
frequently, cache coherence traffic may actually be higher than for the list
head.
I'd prefer to drop this unless there is real data showing it performs better.
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts
2025-11-29 0:04 ` Barry Song
@ 2025-11-29 9:24 ` Barry Song
0 siblings, 0 replies; 20+ messages in thread
From: Barry Song @ 2025-11-29 9:24 UTC (permalink / raw)
To: Hongru Zhang
Cc: akpm, vbabka, david, linux-mm, linux-kernel, surenb, mhocko,
jackmanb, hannes, ziy, lorenzo.stoakes, Liam.Howlett, rppt,
axelrasmussen, yuanchu, weixugc, Hongru Zhang
On Sat, Nov 29, 2025 at 8:04 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Nov 28, 2025 at 11:13 AM Hongru Zhang <zhanghongru06@gmail.com> wrote:
> >
> > From: Hongru Zhang <zhanghongru@xiaomi.com>
> >
> > Use per-migratetype counts instead of list_empty() helps reduce a
> > few cpu instructions.
> >
> > Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
> > ---
> > mm/internal.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 1561fc2ff5b8..7759f8fdf445 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> >
> > static inline bool free_area_empty(struct free_area *area, int migratetype)
> > {
> > - return list_empty(&area->free_list[migratetype]);
> > + return !READ_ONCE(area->mt_nr_free[migratetype]);
>
> I'm not quite sure about this. Since the counter is written and read more
> frequently, cache coherence traffic may actually be higher than for the list
> head.
>
> I'd prefer to drop this unless there is real data showing it performs better.
If the goal is to optimize free_area list checks and list_add,
a reasonable approach is to organize the data structure
to reduce false sharing between different mt and order entries.
struct mt_free_area {
struct list_head free_list;
unsigned long nr_free;
} ____cacheline_aligned;
struct free_area {
struct mt_free_area mt_free_area[MIGRATE_TYPES];
};
However, without supporting data, it’s unclear if the space increase
is justified :-)
Thanks
Barry
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
` (2 preceding siblings ...)
2025-11-28 3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
@ 2025-11-28 7:49 ` Lorenzo Stoakes
2025-11-28 8:34 ` Hongru Zhang
2025-11-28 9:24 ` Vlastimil Babka
4 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 7:49 UTC (permalink / raw)
To: Hongru Zhang
Cc: akpm, vbabka, david, linux-mm, linux-kernel, surenb, mhocko,
jackmanb, hannes, ziy, Liam.Howlett, rppt, axelrasmussen,
yuanchu, weixugc, Hongru Zhang
Just a general plea :) could we please try not to send larger series like
this so late.
We're at the last day before the merge window, this is better sent during
6.19-rc1 or if now as an RFC.
Thanks, Lorenzo
On Fri, Nov 28, 2025 at 11:10:11AM +0800, Hongru Zhang wrote:
> On mobile devices, some user-space memory management components check
> memory pressure and fragmentation status periodically or via PSI, and
> take actions such as killing processes or performing memory compaction
> based on this information.
>
> Under high load scenarios, reading /proc/pagetypeinfo causes memory
> management components or memory allocation/free paths to be blocked
> for extended periods waiting for the zone lock, leading to the following
> issues:
> 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom
> 8750 platforms, reducing system real-time performance
> 2. Memory management components being blocked for extended periods,
> preventing rapid acquisition of memory fragmentation information for
> critical memory management decisions and actions
> 3. Increased latency in memory allocation and free paths due to prolonged
> zone lock contention
>
> Changes:
> 1. Add per-migratetype counts to the buddy allocator to track free page
> block counts for each migratetype and order
> 2. Optimize /proc/pagetypeinfo access by utilizing these per-migratetype
> counts instead of iterating through free lists under zone lock
>
> Performance testing shows following improvements:
> 1. /proc/pagetypeinfo access latency reduced
>
> +-----------------------+----------+------------+
> | | no-patch | with-patch |
> +-----------------------+----------+------------+
> | Just after boot | 700.9 us | 268.6 us |
> +-----------------------+----------+------------+
> | After building kernel | 28.7 ms | 269.8 us |
> +-----------------------+----------+------------+
>
> 2. When /proc/pagetypeinfo is accessed concurrently, memory allocation and
> free performance degradation is reduced compared to the previous
> implementation
>
> Test setup:
> - Using config-pagealloc-micro
> - Monitor set to proc-pagetypeinfo, update frequency set to 10ms
> - PAGEALLOC_ORDER_MIN=4, PAGEALLOC_ORDER_MAX=4
>
> Without patch test results:
> vanilla vanilla
> no-monitor monitor
> Min alloc-odr4-1 8539.00 ( 0.00%) 8762.00 ( -2.61%)
> Min alloc-odr4-2 6501.00 ( 0.00%) 6683.00 ( -2.80%)
> Min alloc-odr4-4 5537.00 ( 0.00%) 5873.00 ( -6.07%)
> Min alloc-odr4-8 5030.00 ( 0.00%) 5361.00 ( -6.58%)
> Min alloc-odr4-16 4782.00 ( 0.00%) 5162.00 ( -7.95%)
> Min alloc-odr4-32 5838.00 ( 0.00%) 6499.00 ( -11.32%)
> Min alloc-odr4-64 6565.00 ( 0.00%) 7413.00 ( -12.92%)
> Min alloc-odr4-128 6896.00 ( 0.00%) 7898.00 ( -14.53%)
> Min alloc-odr4-256 7303.00 ( 0.00%) 8163.00 ( -11.78%)
> Min alloc-odr4-512 10179.00 ( 0.00%) 11985.00 ( -17.74%)
> Min alloc-odr4-1024 11000.00 ( 0.00%) 12165.00 ( -10.59%)
> Min free-odr4-1 820.00 ( 0.00%) 1230.00 ( -50.00%)
> Min free-odr4-2 511.00 ( 0.00%) 952.00 ( -86.30%)
> Min free-odr4-4 347.00 ( 0.00%) 434.00 ( -25.07%)
> Min free-odr4-8 286.00 ( 0.00%) 399.00 ( -39.51%)
> Min free-odr4-16 250.00 ( 0.00%) 405.00 ( -62.00%)
> Min free-odr4-32 294.00 ( 0.00%) 405.00 ( -37.76%)
> Min free-odr4-64 333.00 ( 0.00%) 363.00 ( -9.01%)
> Min free-odr4-128 340.00 ( 0.00%) 412.00 ( -21.18%)
> Min free-odr4-256 339.00 ( 0.00%) 329.00 ( 2.95%)
> Min free-odr4-512 361.00 ( 0.00%) 409.00 ( -13.30%)
> Min free-odr4-1024 300.00 ( 0.00%) 361.00 ( -20.33%)
> Stddev alloc-odr4-1 7.29 ( 0.00%) 90.78 (-1146.00%)
> Stddev alloc-odr4-2 3.87 ( 0.00%) 51.30 (-1225.75%)
> Stddev alloc-odr4-4 3.20 ( 0.00%) 50.90 (-1491.24%)
> Stddev alloc-odr4-8 4.67 ( 0.00%) 52.23 (-1019.35%)
> Stddev alloc-odr4-16 5.72 ( 0.00%) 27.53 (-381.04%)
> Stddev alloc-odr4-32 6.25 ( 0.00%) 641.23 (-10154.46%)
> Stddev alloc-odr4-64 2.06 ( 0.00%) 386.99 (-18714.22%)
> Stddev alloc-odr4-128 14.36 ( 0.00%) 52.39 (-264.77%)
> Stddev alloc-odr4-256 32.42 ( 0.00%) 326.19 (-906.05%)
> Stddev alloc-odr4-512 65.58 ( 0.00%) 184.49 (-181.31%)
> Stddev alloc-odr4-1024 8.88 ( 0.00%) 153.01 (-1622.67%)
> Stddev free-odr4-1 2.29 ( 0.00%) 152.27 (-6549.85%)
> Stddev free-odr4-2 10.99 ( 0.00%) 73.10 (-564.89%)
> Stddev free-odr4-4 1.99 ( 0.00%) 28.40 (-1324.45%)
> Stddev free-odr4-8 2.51 ( 0.00%) 52.93 (-2007.64%)
> Stddev free-odr4-16 2.85 ( 0.00%) 26.04 (-814.88%)
> Stddev free-odr4-32 4.04 ( 0.00%) 27.05 (-569.79%)
> Stddev free-odr4-64 2.10 ( 0.00%) 48.07 (-2185.66%)
> Stddev free-odr4-128 2.63 ( 0.00%) 26.23 (-897.86%)
> Stddev free-odr4-256 6.29 ( 0.00%) 37.04 (-488.71%)
> Stddev free-odr4-512 2.56 ( 0.00%) 10.65 (-315.28%)
> Stddev free-odr4-1024 0.95 ( 0.00%) 6.46 (-582.22%)
> Max alloc-odr4-1 8564.00 ( 0.00%) 9099.00 ( -6.25%)
> Max alloc-odr4-2 6511.00 ( 0.00%) 6844.00 ( -5.11%)
> Max alloc-odr4-4 5549.00 ( 0.00%) 6038.00 ( -8.81%)
> Max alloc-odr4-8 5045.00 ( 0.00%) 5551.00 ( -10.03%)
> Max alloc-odr4-16 4800.00 ( 0.00%) 5257.00 ( -9.52%)
> Max alloc-odr4-32 5861.00 ( 0.00%) 8115.00 ( -38.46%)
> Max alloc-odr4-64 6571.00 ( 0.00%) 8292.00 ( -26.19%)
> Max alloc-odr4-128 6930.00 ( 0.00%) 8081.00 ( -16.61%)
> Max alloc-odr4-256 7372.00 ( 0.00%) 9150.00 ( -24.12%)
> Max alloc-odr4-512 10333.00 ( 0.00%) 12636.00 ( -22.29%)
> Max alloc-odr4-1024 11035.00 ( 0.00%) 12590.00 ( -14.09%)
> Max free-odr4-1 828.00 ( 0.00%) 1724.00 (-108.21%)
> Max free-odr4-2 543.00 ( 0.00%) 1192.00 (-119.52%)
> Max free-odr4-4 354.00 ( 0.00%) 519.00 ( -46.61%)
> Max free-odr4-8 293.00 ( 0.00%) 617.00 (-110.58%)
> Max free-odr4-16 260.00 ( 0.00%) 483.00 ( -85.77%)
> Max free-odr4-32 308.00 ( 0.00%) 488.00 ( -58.44%)
> Max free-odr4-64 341.00 ( 0.00%) 505.00 ( -48.09%)
> Max free-odr4-128 346.00 ( 0.00%) 497.00 ( -43.64%)
> Max free-odr4-256 353.00 ( 0.00%) 463.00 ( -31.16%)
> Max free-odr4-512 367.00 ( 0.00%) 442.00 ( -20.44%)
> Max free-odr4-1024 303.00 ( 0.00%) 381.00 ( -25.74%)
>
> With patch test results:
> patched patched
> no-monitor monitor
> Min alloc-odr4-1 8488.00 ( 0.00%) 8514.00 ( -0.31%)
> Min alloc-odr4-2 6551.00 ( 0.00%) 6527.00 ( 0.37%)
> Min alloc-odr4-4 5536.00 ( 0.00%) 5591.00 ( -0.99%)
> Min alloc-odr4-8 5008.00 ( 0.00%) 5098.00 ( -1.80%)
> Min alloc-odr4-16 4760.00 ( 0.00%) 4857.00 ( -2.04%)
> Min alloc-odr4-32 5827.00 ( 0.00%) 5919.00 ( -1.58%)
> Min alloc-odr4-64 6561.00 ( 0.00%) 6680.00 ( -1.81%)
> Min alloc-odr4-128 6898.00 ( 0.00%) 7014.00 ( -1.68%)
> Min alloc-odr4-256 7311.00 ( 0.00%) 7464.00 ( -2.09%)
> Min alloc-odr4-512 10181.00 ( 0.00%) 10286.00 ( -1.03%)
> Min alloc-odr4-1024 11205.00 ( 0.00%) 11725.00 ( -4.64%)
> Min free-odr4-1 789.00 ( 0.00%) 867.00 ( -9.89%)
> Min free-odr4-2 490.00 ( 0.00%) 526.00 ( -7.35%)
> Min free-odr4-4 350.00 ( 0.00%) 360.00 ( -2.86%)
> Min free-odr4-8 272.00 ( 0.00%) 287.00 ( -5.51%)
> Min free-odr4-16 247.00 ( 0.00%) 254.00 ( -2.83%)
> Min free-odr4-32 298.00 ( 0.00%) 304.00 ( -2.01%)
> Min free-odr4-64 334.00 ( 0.00%) 325.00 ( 2.69%)
> Min free-odr4-128 334.00 ( 0.00%) 329.00 ( 1.50%)
> Min free-odr4-256 336.00 ( 0.00%) 336.00 ( 0.00%)
> Min free-odr4-512 360.00 ( 0.00%) 342.00 ( 5.00%)
> Min free-odr4-1024 327.00 ( 0.00%) 355.00 ( -8.56%)
> Stddev alloc-odr4-1 5.19 ( 0.00%) 45.38 (-775.09%)
> Stddev alloc-odr4-2 6.99 ( 0.00%) 37.63 (-437.98%)
> Stddev alloc-odr4-4 3.91 ( 0.00%) 17.85 (-356.28%)
> Stddev alloc-odr4-8 5.15 ( 0.00%) 9.34 ( -81.47%)
> Stddev alloc-odr4-16 3.83 ( 0.00%) 5.34 ( -39.34%)
> Stddev alloc-odr4-32 1.96 ( 0.00%) 10.28 (-425.09%)
> Stddev alloc-odr4-64 1.32 ( 0.00%) 333.30 (-25141.39%)
> Stddev alloc-odr4-128 2.06 ( 0.00%) 7.37 (-258.28%)
> Stddev alloc-odr4-256 15.56 ( 0.00%) 113.48 (-629.25%)
> Stddev alloc-odr4-512 61.25 ( 0.00%) 165.09 (-169.53%)
> Stddev alloc-odr4-1024 18.89 ( 0.00%) 2.93 ( 84.51%)
> Stddev free-odr4-1 4.45 ( 0.00%) 40.12 (-800.98%)
> Stddev free-odr4-2 1.50 ( 0.00%) 29.30 (-1850.31%)
> Stddev free-odr4-4 1.27 ( 0.00%) 19.49 (-1439.40%)
> Stddev free-odr4-8 0.97 ( 0.00%) 8.93 (-823.07%)
> Stddev free-odr4-16 8.38 ( 0.00%) 4.51 ( 46.21%)
> Stddev free-odr4-32 3.18 ( 0.00%) 6.59 (-107.42%)
> Stddev free-odr4-64 2.40 ( 0.00%) 3.09 ( -28.50%)
> Stddev free-odr4-128 1.55 ( 0.00%) 2.53 ( -62.92%)
> Stddev free-odr4-256 0.41 ( 0.00%) 2.80 (-585.57%)
> Stddev free-odr4-512 1.60 ( 0.00%) 4.84 (-202.08%)
> Stddev free-odr4-1024 0.66 ( 0.00%) 1.19 ( -80.68%)
> Max alloc-odr4-1 8505.00 ( 0.00%) 8676.00 ( -2.01%)
> Max alloc-odr4-2 6572.00 ( 0.00%) 6651.00 ( -1.20%)
> Max alloc-odr4-4 5552.00 ( 0.00%) 5646.00 ( -1.69%)
> Max alloc-odr4-8 5024.00 ( 0.00%) 5131.00 ( -2.13%)
> Max alloc-odr4-16 4774.00 ( 0.00%) 4875.00 ( -2.12%)
> Max alloc-odr4-32 5834.00 ( 0.00%) 5950.00 ( -1.99%)
> Max alloc-odr4-64 6565.00 ( 0.00%) 7434.00 ( -13.24%)
> Max alloc-odr4-128 6907.00 ( 0.00%) 7034.00 ( -1.84%)
> Max alloc-odr4-256 7347.00 ( 0.00%) 7843.00 ( -6.75%)
> Max alloc-odr4-512 10315.00 ( 0.00%) 10866.00 ( -5.34%)
> Max alloc-odr4-1024 11278.00 ( 0.00%) 11733.00 ( -4.03%)
> Max free-odr4-1 803.00 ( 0.00%) 1009.00 ( -25.65%)
> Max free-odr4-2 495.00 ( 0.00%) 607.00 ( -22.63%)
> Max free-odr4-4 354.00 ( 0.00%) 417.00 ( -17.80%)
> Max free-odr4-8 275.00 ( 0.00%) 313.00 ( -13.82%)
> Max free-odr4-16 273.00 ( 0.00%) 272.00 ( 0.37%)
> Max free-odr4-32 309.00 ( 0.00%) 324.00 ( -4.85%)
> Max free-odr4-64 340.00 ( 0.00%) 335.00 ( 1.47%)
> Max free-odr4-128 340.00 ( 0.00%) 338.00 ( 0.59%)
> Max free-odr4-256 338.00 ( 0.00%) 346.00 ( -2.37%)
> Max free-odr4-512 364.00 ( 0.00%) 359.00 ( 1.37%)
> Max free-odr4-1024 329.00 ( 0.00%) 359.00 ( -9.12%)
>
> The main overhead is a slight increase in latency on the memory allocation
> and free paths due to additional per-migratetype counting, with
> theoretically minimal impact on overall performance.
>
> This patch series is based on v6.18-rc7
>
> Hongru Zhang (3):
> mm/page_alloc: add per-migratetype counts to buddy allocator
> mm/vmstat: get fragmentation statistics from per-migragetype count
> mm: optimize free_area_empty() check using per-migratetype counts
>
> include/linux/mmzone.h | 1 +
> mm/internal.h | 2 +-
> mm/mm_init.c | 1 +
> mm/page_alloc.c | 9 ++++++++-
> mm/vmstat.c | 30 +++++++-----------------------
> 5 files changed, 18 insertions(+), 25 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
@ 2025-11-28 8:34 ` Hongru Zhang
2025-11-28 8:40 ` Lorenzo Stoakes
0 siblings, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-11-28 8:34 UTC (permalink / raw)
To: lorenzo.stoakes
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, mhocko, rppt, surenb, vbabka, weixugc,
yuanchu, zhanghongru06, zhanghongru, ziy
> Just a general plea :) could we please try not to send larger series like
> this so late.
>
> We're at the last day before the merge window, this is better sent during
> 6.19-rc1 or if now as an RFC.
>
> Thanks, Lorenzo
Hi Lorenzo,
Thank you for your feedback and sorry for the late submission. You're
right - this series should have been sent earlier. Apologize for not
following the proper submission timing guidelines. I'll make sure to
follow the community norms and submit similar work well in advance in
the future.
Thanks again for your patience and guidance.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 8:34 ` Hongru Zhang
@ 2025-11-28 8:40 ` Lorenzo Stoakes
0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-11-28 8:40 UTC (permalink / raw)
To: Hongru Zhang
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, mhocko, rppt, surenb, vbabka, weixugc,
yuanchu, zhanghongru, ziy
On Fri, Nov 28, 2025 at 04:34:37PM +0800, Hongru Zhang wrote:
> > Just a general plea :) could we please try not to send larger series like
> > this so late.
> >
> > We're at the last day before the merge window, this is better sent during
> > 6.19-rc1 or if now as an RFC.
> >
> > Thanks, Lorenzo
>
> Hi Lorenzo,
>
> Thank you for your feedback and sorry for the late submission. You're
> right - this series should have been sent earlier. Apologize for not
> following the proper submission timing guidelines. I'll make sure to
> follow the community norms and submit similar work well in advance in
> the future.
>
> Thanks again for your patience and guidance.
Hi Hongru,
Sorry I don't mean to be critical here and you weren't to know :) rather just in
general - a plea for how we do things in mm.
Your series is very much appreciated, you didn't do anything wrong at all - this
is just essentially an admin thing :P
We will absolutely review your series it's just about timing. And of course I'm
just sort of making a point here, reviewers can choose to review as and when
they want! :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
` (3 preceding siblings ...)
2025-11-28 7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
@ 2025-11-28 9:24 ` Vlastimil Babka
2025-11-28 13:08 ` Johannes Weiner
2025-12-01 2:36 ` Hongru Zhang
4 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2025-11-28 9:24 UTC (permalink / raw)
To: Hongru Zhang, akpm, david
Cc: linux-mm, linux-kernel, surenb, mhocko, jackmanb, hannes, ziy,
lorenzo.stoakes, Liam.Howlett, rppt, axelrasmussen, yuanchu,
weixugc, Hongru Zhang
On 11/28/25 04:10, Hongru Zhang wrote:
> On mobile devices, some user-space memory management components check
> memory pressure and fragmentation status periodically or via PSI, and
> take actions such as killing processes or performing memory compaction
> based on this information.
Hm /proc/buddyinfo could be enough to determine fragmentation? Also we have
in-kernel proactive compaction these days.
> Under high load scenarios, reading /proc/pagetypeinfo causes memory
> management components or memory allocation/free paths to be blocked
> for extended periods waiting for the zone lock, leading to the following
> issues:
> 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom
> 8750 platforms, reducing system real-time performance
> 2. Memory management components being blocked for extended periods,
> preventing rapid acquisition of memory fragmentation information for
> critical memory management decisions and actions
> 3. Increased latency in memory allocation and free paths due to prolonged
> zone lock contention
It could be argued that not capturing /proc/pagetypeinfo (often) would help.
I wonder if we can find also other benefits from the counters in the kernel
itself.
Adding these migratetype counters is something that wouldn't be even
possible in the past, until the freelist migratetype hygiene was merged.
So now it should be AFAIK possible, but it's still some overhead in
relatively hot paths. I wonder if we even considered this before in the
context of migratetype hygiene? Couldn't find anything quickly.
> Changes:
> 1. Add per-migratetype counts to the buddy allocator to track free page
> block counts for each migratetype and order
> 2. Optimize /proc/pagetypeinfo access by utilizing these per-migratetype
> counts instead of iterating through free lists under zone lock
>
> Performance testing shows following improvements:
> 1. /proc/pagetypeinfo access latency reduced
>
> +-----------------------+----------+------------+
> | | no-patch | with-patch |
> +-----------------------+----------+------------+
> | Just after boot | 700.9 us | 268.6 us |
> +-----------------------+----------+------------+
> | After building kernel | 28.7 ms | 269.8 us |
> +-----------------------+----------+------------+
>
> 2. When /proc/pagetypeinfo is accessed concurrently, memory allocation and
> free performance degradation is reduced compared to the previous
> implementation
>
> Test setup:
> - Using config-pagealloc-micro
> - Monitor set to proc-pagetypeinfo, update frequency set to 10ms
> - PAGEALLOC_ORDER_MIN=4, PAGEALLOC_ORDER_MAX=4
>
> Without patch test results:
> vanilla vanilla
> no-monitor monitor
> Min alloc-odr4-1 8539.00 ( 0.00%) 8762.00 ( -2.61%)
> Min alloc-odr4-2 6501.00 ( 0.00%) 6683.00 ( -2.80%)
> Min alloc-odr4-4 5537.00 ( 0.00%) 5873.00 ( -6.07%)
> Min alloc-odr4-8 5030.00 ( 0.00%) 5361.00 ( -6.58%)
> Min alloc-odr4-16 4782.00 ( 0.00%) 5162.00 ( -7.95%)
> Min alloc-odr4-32 5838.00 ( 0.00%) 6499.00 ( -11.32%)
> Min alloc-odr4-64 6565.00 ( 0.00%) 7413.00 ( -12.92%)
> Min alloc-odr4-128 6896.00 ( 0.00%) 7898.00 ( -14.53%)
> Min alloc-odr4-256 7303.00 ( 0.00%) 8163.00 ( -11.78%)
> Min alloc-odr4-512 10179.00 ( 0.00%) 11985.00 ( -17.74%)
> Min alloc-odr4-1024 11000.00 ( 0.00%) 12165.00 ( -10.59%)
> Min free-odr4-1 820.00 ( 0.00%) 1230.00 ( -50.00%)
> Min free-odr4-2 511.00 ( 0.00%) 952.00 ( -86.30%)
> Min free-odr4-4 347.00 ( 0.00%) 434.00 ( -25.07%)
> Min free-odr4-8 286.00 ( 0.00%) 399.00 ( -39.51%)
> Min free-odr4-16 250.00 ( 0.00%) 405.00 ( -62.00%)
> Min free-odr4-32 294.00 ( 0.00%) 405.00 ( -37.76%)
> Min free-odr4-64 333.00 ( 0.00%) 363.00 ( -9.01%)
> Min free-odr4-128 340.00 ( 0.00%) 412.00 ( -21.18%)
> Min free-odr4-256 339.00 ( 0.00%) 329.00 ( 2.95%)
> Min free-odr4-512 361.00 ( 0.00%) 409.00 ( -13.30%)
> Min free-odr4-1024 300.00 ( 0.00%) 361.00 ( -20.33%)
> Stddev alloc-odr4-1 7.29 ( 0.00%) 90.78 (-1146.00%)
> Stddev alloc-odr4-2 3.87 ( 0.00%) 51.30 (-1225.75%)
> Stddev alloc-odr4-4 3.20 ( 0.00%) 50.90 (-1491.24%)
> Stddev alloc-odr4-8 4.67 ( 0.00%) 52.23 (-1019.35%)
> Stddev alloc-odr4-16 5.72 ( 0.00%) 27.53 (-381.04%)
> Stddev alloc-odr4-32 6.25 ( 0.00%) 641.23 (-10154.46%)
> Stddev alloc-odr4-64 2.06 ( 0.00%) 386.99 (-18714.22%)
> Stddev alloc-odr4-128 14.36 ( 0.00%) 52.39 (-264.77%)
> Stddev alloc-odr4-256 32.42 ( 0.00%) 326.19 (-906.05%)
> Stddev alloc-odr4-512 65.58 ( 0.00%) 184.49 (-181.31%)
> Stddev alloc-odr4-1024 8.88 ( 0.00%) 153.01 (-1622.67%)
> Stddev free-odr4-1 2.29 ( 0.00%) 152.27 (-6549.85%)
> Stddev free-odr4-2 10.99 ( 0.00%) 73.10 (-564.89%)
> Stddev free-odr4-4 1.99 ( 0.00%) 28.40 (-1324.45%)
> Stddev free-odr4-8 2.51 ( 0.00%) 52.93 (-2007.64%)
> Stddev free-odr4-16 2.85 ( 0.00%) 26.04 (-814.88%)
> Stddev free-odr4-32 4.04 ( 0.00%) 27.05 (-569.79%)
> Stddev free-odr4-64 2.10 ( 0.00%) 48.07 (-2185.66%)
> Stddev free-odr4-128 2.63 ( 0.00%) 26.23 (-897.86%)
> Stddev free-odr4-256 6.29 ( 0.00%) 37.04 (-488.71%)
> Stddev free-odr4-512 2.56 ( 0.00%) 10.65 (-315.28%)
> Stddev free-odr4-1024 0.95 ( 0.00%) 6.46 (-582.22%)
> Max alloc-odr4-1 8564.00 ( 0.00%) 9099.00 ( -6.25%)
> Max alloc-odr4-2 6511.00 ( 0.00%) 6844.00 ( -5.11%)
> Max alloc-odr4-4 5549.00 ( 0.00%) 6038.00 ( -8.81%)
> Max alloc-odr4-8 5045.00 ( 0.00%) 5551.00 ( -10.03%)
> Max alloc-odr4-16 4800.00 ( 0.00%) 5257.00 ( -9.52%)
> Max alloc-odr4-32 5861.00 ( 0.00%) 8115.00 ( -38.46%)
> Max alloc-odr4-64 6571.00 ( 0.00%) 8292.00 ( -26.19%)
> Max alloc-odr4-128 6930.00 ( 0.00%) 8081.00 ( -16.61%)
> Max alloc-odr4-256 7372.00 ( 0.00%) 9150.00 ( -24.12%)
> Max alloc-odr4-512 10333.00 ( 0.00%) 12636.00 ( -22.29%)
> Max alloc-odr4-1024 11035.00 ( 0.00%) 12590.00 ( -14.09%)
> Max free-odr4-1 828.00 ( 0.00%) 1724.00 (-108.21%)
> Max free-odr4-2 543.00 ( 0.00%) 1192.00 (-119.52%)
> Max free-odr4-4 354.00 ( 0.00%) 519.00 ( -46.61%)
> Max free-odr4-8 293.00 ( 0.00%) 617.00 (-110.58%)
> Max free-odr4-16 260.00 ( 0.00%) 483.00 ( -85.77%)
> Max free-odr4-32 308.00 ( 0.00%) 488.00 ( -58.44%)
> Max free-odr4-64 341.00 ( 0.00%) 505.00 ( -48.09%)
> Max free-odr4-128 346.00 ( 0.00%) 497.00 ( -43.64%)
> Max free-odr4-256 353.00 ( 0.00%) 463.00 ( -31.16%)
> Max free-odr4-512 367.00 ( 0.00%) 442.00 ( -20.44%)
> Max free-odr4-1024 303.00 ( 0.00%) 381.00 ( -25.74%)
>
> With patch test results:
> patched patched
> no-monitor monitor
> Min alloc-odr4-1 8488.00 ( 0.00%) 8514.00 ( -0.31%)
> Min alloc-odr4-2 6551.00 ( 0.00%) 6527.00 ( 0.37%)
> Min alloc-odr4-4 5536.00 ( 0.00%) 5591.00 ( -0.99%)
> Min alloc-odr4-8 5008.00 ( 0.00%) 5098.00 ( -1.80%)
> Min alloc-odr4-16 4760.00 ( 0.00%) 4857.00 ( -2.04%)
> Min alloc-odr4-32 5827.00 ( 0.00%) 5919.00 ( -1.58%)
> Min alloc-odr4-64 6561.00 ( 0.00%) 6680.00 ( -1.81%)
> Min alloc-odr4-128 6898.00 ( 0.00%) 7014.00 ( -1.68%)
> Min alloc-odr4-256 7311.00 ( 0.00%) 7464.00 ( -2.09%)
> Min alloc-odr4-512 10181.00 ( 0.00%) 10286.00 ( -1.03%)
> Min alloc-odr4-1024 11205.00 ( 0.00%) 11725.00 ( -4.64%)
> Min free-odr4-1 789.00 ( 0.00%) 867.00 ( -9.89%)
> Min free-odr4-2 490.00 ( 0.00%) 526.00 ( -7.35%)
> Min free-odr4-4 350.00 ( 0.00%) 360.00 ( -2.86%)
> Min free-odr4-8 272.00 ( 0.00%) 287.00 ( -5.51%)
> Min free-odr4-16 247.00 ( 0.00%) 254.00 ( -2.83%)
> Min free-odr4-32 298.00 ( 0.00%) 304.00 ( -2.01%)
> Min free-odr4-64 334.00 ( 0.00%) 325.00 ( 2.69%)
> Min free-odr4-128 334.00 ( 0.00%) 329.00 ( 1.50%)
> Min free-odr4-256 336.00 ( 0.00%) 336.00 ( 0.00%)
> Min free-odr4-512 360.00 ( 0.00%) 342.00 ( 5.00%)
> Min free-odr4-1024 327.00 ( 0.00%) 355.00 ( -8.56%)
> Stddev alloc-odr4-1 5.19 ( 0.00%) 45.38 (-775.09%)
> Stddev alloc-odr4-2 6.99 ( 0.00%) 37.63 (-437.98%)
> Stddev alloc-odr4-4 3.91 ( 0.00%) 17.85 (-356.28%)
> Stddev alloc-odr4-8 5.15 ( 0.00%) 9.34 ( -81.47%)
> Stddev alloc-odr4-16 3.83 ( 0.00%) 5.34 ( -39.34%)
> Stddev alloc-odr4-32 1.96 ( 0.00%) 10.28 (-425.09%)
> Stddev alloc-odr4-64 1.32 ( 0.00%) 333.30 (-25141.39%)
> Stddev alloc-odr4-128 2.06 ( 0.00%) 7.37 (-258.28%)
> Stddev alloc-odr4-256 15.56 ( 0.00%) 113.48 (-629.25%)
> Stddev alloc-odr4-512 61.25 ( 0.00%) 165.09 (-169.53%)
> Stddev alloc-odr4-1024 18.89 ( 0.00%) 2.93 ( 84.51%)
> Stddev free-odr4-1 4.45 ( 0.00%) 40.12 (-800.98%)
> Stddev free-odr4-2 1.50 ( 0.00%) 29.30 (-1850.31%)
> Stddev free-odr4-4 1.27 ( 0.00%) 19.49 (-1439.40%)
> Stddev free-odr4-8 0.97 ( 0.00%) 8.93 (-823.07%)
> Stddev free-odr4-16 8.38 ( 0.00%) 4.51 ( 46.21%)
> Stddev free-odr4-32 3.18 ( 0.00%) 6.59 (-107.42%)
> Stddev free-odr4-64 2.40 ( 0.00%) 3.09 ( -28.50%)
> Stddev free-odr4-128 1.55 ( 0.00%) 2.53 ( -62.92%)
> Stddev free-odr4-256 0.41 ( 0.00%) 2.80 (-585.57%)
> Stddev free-odr4-512 1.60 ( 0.00%) 4.84 (-202.08%)
> Stddev free-odr4-1024 0.66 ( 0.00%) 1.19 ( -80.68%)
> Max alloc-odr4-1 8505.00 ( 0.00%) 8676.00 ( -2.01%)
> Max alloc-odr4-2 6572.00 ( 0.00%) 6651.00 ( -1.20%)
> Max alloc-odr4-4 5552.00 ( 0.00%) 5646.00 ( -1.69%)
> Max alloc-odr4-8 5024.00 ( 0.00%) 5131.00 ( -2.13%)
> Max alloc-odr4-16 4774.00 ( 0.00%) 4875.00 ( -2.12%)
> Max alloc-odr4-32 5834.00 ( 0.00%) 5950.00 ( -1.99%)
> Max alloc-odr4-64 6565.00 ( 0.00%) 7434.00 ( -13.24%)
> Max alloc-odr4-128 6907.00 ( 0.00%) 7034.00 ( -1.84%)
> Max alloc-odr4-256 7347.00 ( 0.00%) 7843.00 ( -6.75%)
> Max alloc-odr4-512 10315.00 ( 0.00%) 10866.00 ( -5.34%)
> Max alloc-odr4-1024 11278.00 ( 0.00%) 11733.00 ( -4.03%)
> Max free-odr4-1 803.00 ( 0.00%) 1009.00 ( -25.65%)
> Max free-odr4-2 495.00 ( 0.00%) 607.00 ( -22.63%)
> Max free-odr4-4 354.00 ( 0.00%) 417.00 ( -17.80%)
> Max free-odr4-8 275.00 ( 0.00%) 313.00 ( -13.82%)
> Max free-odr4-16 273.00 ( 0.00%) 272.00 ( 0.37%)
> Max free-odr4-32 309.00 ( 0.00%) 324.00 ( -4.85%)
> Max free-odr4-64 340.00 ( 0.00%) 335.00 ( 1.47%)
> Max free-odr4-128 340.00 ( 0.00%) 338.00 ( 0.59%)
> Max free-odr4-256 338.00 ( 0.00%) 346.00 ( -2.37%)
> Max free-odr4-512 364.00 ( 0.00%) 359.00 ( 1.37%)
> Max free-odr4-1024 329.00 ( 0.00%) 359.00 ( -9.12%)
>
> The main overhead is a slight increase in latency on the memory allocation
> and free paths due to additional per-migratetype counting, with
> theoretically minimal impact on overall performance.
>
> This patch series is based on v6.18-rc7
>
> Hongru Zhang (3):
> mm/page_alloc: add per-migratetype counts to buddy allocator
> mm/vmstat: get fragmentation statistics from per-migragetype count
> mm: optimize free_area_empty() check using per-migratetype counts
>
> include/linux/mmzone.h | 1 +
> mm/internal.h | 2 +-
> mm/mm_init.c | 1 +
> mm/page_alloc.c | 9 ++++++++-
> mm/vmstat.c | 30 +++++++-----------------------
> 5 files changed, 18 insertions(+), 25 deletions(-)
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 9:24 ` Vlastimil Babka
@ 2025-11-28 13:08 ` Johannes Weiner
2025-12-01 2:36 ` Hongru Zhang
1 sibling, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2025-11-28 13:08 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Hongru Zhang, akpm, david, linux-mm, linux-kernel, surenb,
mhocko, jackmanb, ziy, lorenzo.stoakes, Liam.Howlett, rppt,
axelrasmussen, yuanchu, weixugc, Hongru Zhang
On Fri, Nov 28, 2025 at 10:24:16AM +0100, Vlastimil Babka wrote:
> On 11/28/25 04:10, Hongru Zhang wrote:
> > On mobile devices, some user-space memory management components check
> > memory pressure and fragmentation status periodically or via PSI, and
> > take actions such as killing processes or performing memory compaction
> > based on this information.
>
> Hm /proc/buddyinfo could be enough to determine fragmentation? Also we have
> in-kernel proactive compaction these days.
>
> > Under high load scenarios, reading /proc/pagetypeinfo causes memory
> > management components or memory allocation/free paths to be blocked
> > for extended periods waiting for the zone lock, leading to the following
> > issues:
> > 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom
> > 8750 platforms, reducing system real-time performance
> > 2. Memory management components being blocked for extended periods,
> > preventing rapid acquisition of memory fragmentation information for
> > critical memory management decisions and actions
> > 3. Increased latency in memory allocation and free paths due to prolonged
> > zone lock contention
>
> It could be argued that not capturing /proc/pagetypeinfo (often) would help.
> I wonder if we can find also other benefits from the counters in the kernel
> itself.
In earlier iterations of the huge allocator patches, I played around
with using these for compaction_suitable():
https://lore.kernel.org/linux-mm/20230418191313.268131-17-hannes@cmpxchg.org/
ISTR it cut down compaction numbers, because it would avoid runs where
free pages are mostly in unsuitable targets (free_unmovable). But this
was also in a series that used compaction_suitable() to stop kswapd,
which in hindsight was a mistake; it would need re-evaluating by itself.
I also found these counters useful to have in OOM/allocfail dumps to
see if allocator packing or compaction could have done better.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-11-28 9:24 ` Vlastimil Babka
2025-11-28 13:08 ` Johannes Weiner
@ 2025-12-01 2:36 ` Hongru Zhang
2025-12-01 17:01 ` Zi Yan
1 sibling, 1 reply; 20+ messages in thread
From: Hongru Zhang @ 2025-12-01 2:36 UTC (permalink / raw)
To: vbabka
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt, surenb,
weixugc, yuanchu, zhanghongru06, zhanghongru, ziy
> > On mobile devices, some user-space memory management components check
> > memory pressure and fragmentation status periodically or via PSI, and
> > take actions such as killing processes or performing memory compaction
> > based on this information.
>
> Hm /proc/buddyinfo could be enough to determine fragmentation? Also we have
> in-kernel proactive compaction these days.
In fact, besides /proc/pagetypeinfo, other system resource information is
also collected at appropriate times, and resource usage throughout the
process lifecycle is appropriately tracked as well. User-space management
components integrate this information together to make decisions and
perform proper actions.
> > Under high load scenarios, reading /proc/pagetypeinfo causes memory
> > management components or memory allocation/free paths to be blocked
> > for extended periods waiting for the zone lock, leading to the following
> > issues:
> > 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom
> > 8750 platforms, reducing system real-time performance
> > 2. Memory management components being blocked for extended periods,
> > preventing rapid acquisition of memory fragmentation information for
> > critical memory management decisions and actions
> > 3. Increased latency in memory allocation and free paths due to prolonged
> > zone lock contention
>
> It could be argued that not capturing /proc/pagetypeinfo (often) would help.
> I wonder if we can find also other benefits from the counters in the kernel
> itself.
Collecting system and app resource statistics and making decisions based
on this information is a common practice among Android device manufacturers.
Currently, there should be over a billion Android phones being used daily
worldwide. The diversity of hardware configurations across Android devices
makes it difficult for kernel mechanisms alone to maintain good
performance across all usage scenarios.
First, hardware capabilities vary greatly - flagship phones may have up to
24GB of memory, while low-end devices may have as little as 4GB. CPU,
storage, battery, and passive cooling capabilities vary significantly due
to market positioning and cost factors. Hardware resources seem always
inadequate.
Second, usage scenarios also differ - some people use devices in hot
environments while others in cold environments; some enjoy high-definition
gaming while others simply browse the web.
Third, user habits vary as well. Some people rarely restart their phones
except when the battery dies or the system crashes; others restart daily,
like me. Some users never actively close apps, only switching them to
the background, resulting in dozens of apps running in the background and
keeping system resources consumed (especially memory). Yet others just use
a few apps, closing unused apps rather than leaving them in the
background.
Despite the above challenges, Android device manufacturers hope to ensure
a good user experience (no UI jank) across all situations.
Even at 60 Hz frame refresh rate (90 Hz, 120 Hz also supported now), all
work from user input to render and display should be done within 16.7 ms.
To achieve this goal, the management components perform tasks such as:
- Track system resource status: what system has
(system resource awareness)
- Learn and predict app resource demands: what app needs
(resource demand awareness)
- Monitor app launch, exit, and foreground-background switches: least
important app gives back resource to system to serve most important
one, usually the foreground app
(user intent awareness)
Tracking system resources seems necessary for Android devices, not
optional. So the related paths are not that cold on Android devices.
All the above are from workload perspective. From the kernel perspective,
regardless of when or how frequently user-space tools read statistical
information, they should not affect the kernel's own efficiency
significantly. That's why I submit this patch series to make the read side
of /proc/pagetypeinfo lock-free. But this does introduce overhead in hot
path, I would greatly appreciate if we can discuss how to improve it here.
> Adding these migratetype counters is something that wouldn't be even
> possible in the past, until the freelist migratetype hygiene was merged.
> So now it should be AFAIK possible, but it's still some overhead in
> relatively hot paths. I wonder if we even considered this before in the
> context of migratetype hygiene? Couldn't find anything quickly.
Yes, I wrote the code on old kernel initially, at that time, I reused
set_pcppage_migratetype (also renamed) to cache the exact migratetype
list that the page block is on. After the freelist migratetype hygiene
patches were merged, I removed that logic.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-12-01 2:36 ` Hongru Zhang
@ 2025-12-01 17:01 ` Zi Yan
2025-12-02 2:42 ` Hongru Zhang
0 siblings, 1 reply; 20+ messages in thread
From: Zi Yan @ 2025-12-01 17:01 UTC (permalink / raw)
To: Hongru Zhang
Cc: vbabka, Liam.Howlett, akpm, axelrasmussen, david, hannes,
jackmanb, linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt,
surenb, weixugc, yuanchu, zhanghongru
On 30 Nov 2025, at 21:36, Hongru Zhang wrote:
>>> On mobile devices, some user-space memory management components check
>>> memory pressure and fragmentation status periodically or via PSI, and
>>> take actions such as killing processes or performing memory compaction
>>> based on this information.
>>
>> Hm /proc/buddyinfo could be enough to determine fragmentation? Also we have
>> in-kernel proactive compaction these days.
>
> In fact, besides /proc/pagetypeinfo, other system resource information is
> also collected at appropriate times, and resource usage throughout the
> process lifecycle is appropriately tracked as well. User-space management
> components integrate this information together to make decisions and
> perform proper actions.
>
>>> Under high load scenarios, reading /proc/pagetypeinfo causes memory
>>> management components or memory allocation/free paths to be blocked
>>> for extended periods waiting for the zone lock, leading to the following
>>> issues:
>>> 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom
>>> 8750 platforms, reducing system real-time performance
>>> 2. Memory management components being blocked for extended periods,
>>> preventing rapid acquisition of memory fragmentation information for
>>> critical memory management decisions and actions
>>> 3. Increased latency in memory allocation and free paths due to prolonged
>>> zone lock contention
>>
>> It could be argued that not capturing /proc/pagetypeinfo (often) would help.
>> I wonder if we can find also other benefits from the counters in the kernel
>> itself.
>
> Collecting system and app resource statistics and making decisions based
> on this information is a common practice among Android device manufacturers.
>
> Currently, there should be over a billion Android phones being used daily
> worldwide. The diversity of hardware configurations across Android devices
> makes it difficult for kernel mechanisms alone to maintain good
> performance across all usage scenarios.
>
> First, hardware capabilities vary greatly - flagship phones may have up to
> 24GB of memory, while low-end devices may have as little as 4GB. CPU,
> storage, battery, and passive cooling capabilities vary significantly due
> to market positioning and cost factors. Hardware resources seem always
> inadequate.
>
> Second, usage scenarios also differ - some people use devices in hot
> environments while others in cold environments; some enjoy high-definition
> gaming while others simply browse the web.
>
> Third, user habits vary as well. Some people rarely restart their phones
> except when the battery dies or the system crashes; others restart daily,
> like me. Some users never actively close apps, only switching them to
> the background, resulting in dozens of apps running in the background and
> keeping system resources consumed (especially memory). Yet others just use
> a few apps, closing unused apps rather than leaving them in the
> background.
>
> Despite the above challenges, Android device manufacturers hope to ensure
> a good user experience (no UI jank) across all situations.
>
> Even at 60 Hz frame refresh rate (90 Hz, 120 Hz also supported now), all
> work from user input to render and display should be done within 16.7 ms.
> To achieve this goal, the management components perform tasks such as:
> - Track system resource status: what system has
> (system resource awareness)
> - Learn and predict app resource demands: what app needs
> (resource demand awareness)
> - Monitor app launch, exit, and foreground-background switches: least
> important app gives back resource to system to serve most important
> one, usually the foreground app
> (user intent awareness)
>
> Tracking system resources seems necessary for Android devices, not
> optional. So the related paths are not that cold on Android devices.
These are all good background information. But how does userspace monitor
utilize pageblock migratetype information? Can you give a concrete example?
Something like when free_movable is low, background apps is killed to
provide more free pages? Or is userspace monitor even trying to attribute
different pageblock usage to each app by monitoring /proc/pagetypeinfo
before and after an app launch?
Thanks.
>
> All the above are from workload perspective. From the kernel perspective,
> regardless of when or how frequently user-space tools read statistical
> information, they should not affect the kernel's own efficiency
> significantly. That's why I submit this patch series to make the read side
> of /proc/pagetypeinfo lock-free. But this does introduce overhead in hot
> path, I would greatly appreciate if we can discuss how to improve it here.
>
>> Adding these migratetype counters is something that wouldn't be even
>> possible in the past, until the freelist migratetype hygiene was merged.
>> So now it should be AFAIK possible, but it's still some overhead in
>> relatively hot paths. I wonder if we even considered this before in the
>> context of migratetype hygiene? Couldn't find anything quickly.
>
> Yes, I wrote the code on old kernel initially, at that time, I reused
> set_pcppage_migratetype (also renamed) to cache the exact migratetype
> list that the page block is on. After the freelist migratetype hygiene
> patches were merged, I removed that logic.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access
2025-12-01 17:01 ` Zi Yan
@ 2025-12-02 2:42 ` Hongru Zhang
0 siblings, 0 replies; 20+ messages in thread
From: Hongru Zhang @ 2025-12-02 2:42 UTC (permalink / raw)
To: ziy
Cc: Liam.Howlett, akpm, axelrasmussen, david, hannes, jackmanb,
linux-kernel, linux-mm, lorenzo.stoakes, mhocko, rppt, surenb,
vbabka, weixugc, yuanchu, zhanghongru06, zhanghongru
> > Despite the above challenges, Android device manufacturers hope to ensure
> > a good user experience (no UI jank) across all situations.
> >
> > Even at 60 Hz frame refresh rate (90 Hz, 120 Hz also supported now), all
> > work from user input to render and display should be done within 16.7 ms.
> > To achieve this goal, the management components perform tasks such as:
> > - Track system resource status: what system has
> > (system resource awareness)
> > - Learn and predict app resource demands: what app needs
> > (resource demand awareness)
> > - Monitor app launch, exit, and foreground-background switches: least
> > important app gives back resource to system to serve most important
> > one, usually the foreground app
> > (user intent awareness)
> >
> > Tracking system resources seems necessary for Android devices, not
> > optional. So the related paths are not that cold on Android devices.
>
> These are all good background information. But how does userspace monitor
> utilize pageblock migratetype information? Can you give a concrete example?
>
> Something like when free_movable is low, background apps is killed to
> provide more free pages? Or is userspace monitor even trying to attribute
> different pageblock usage to each app by monitoring /proc/pagetypeinfo
> before and after an app launch?
>
> Thanks.
AOSP:
https://android.googlesource.com/platform/frameworks/base/+/refs/heads/main/core/java/com/android/internal/app/procstats/ProcessStats.java#:~:text=public%20void%20updateFragmentation()
We have proprietary algorithms, but they are confidential. I cannot
describe them publicly, sorry about that.
^ permalink raw reply [flat|nested] 20+ messages in thread