From: Hongru Zhang <zhanghongru06@gmail.com>
To: 21cnbao@gmail.com
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
axelrasmussen@google.com, david@kernel.org, hannes@cmpxchg.org,
jackmanb@google.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, lorenzo.stoakes@oracle.com, mhocko@suse.com,
rppt@kernel.org, surenb@google.com, vbabka@suse.cz,
weixugc@google.com, yuanchu@google.com, zhanghongru06@gmail.com,
zhanghongru@xiaomi.com, ziy@nvidia.com
Subject: Re: [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts
Date: Tue, 3 Mar 2026 16:04:20 +0800 [thread overview]
Message-ID: <20260303080423.472534-1-zhanghongru@xiaomi.com> (raw)
In-Reply-To: <CAGsJ_4wCeLr6KOTU=Pc4ALeq5x-i0C7i6C3cSddexHw2ADSnng@mail.gmail.com>
> On Sat, Nov 29, 2025 at 8:04 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Nov 28, 2025 at 11:13 AM Hongru Zhang <zhanghongru06@gmail.com> wrote:
> > >
> > > From: Hongru Zhang <zhanghongru@xiaomi.com>
> > >
> > > Use per-migratetype counts instead of list_empty() helps reduce a
> > > few cpu instructions.
> > >
> > > Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
> > > ---
> > > mm/internal.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 1561fc2ff5b8..7759f8fdf445 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> > >
> > > static inline bool free_area_empty(struct free_area *area, int migratetype)
> > > {
> > > - return list_empty(&area->free_list[migratetype]);
> > > + return !READ_ONCE(area->mt_nr_free[migratetype]);
> >
> > I'm not quite sure about this. Since the counter is written and read more
> > frequently, cache coherence traffic may actually be higher than for the list
> > head.
> >
> > I'd prefer to drop this unless there is real data showing it performs better.
>
> If the goal is to optimize free_area list checks and list_add,
> a reasonable approach is to organize the data structure
> to reduce false sharing between different mt and order entries.
>
> struct mt_free_area {
> struct list_head free_list;
> unsigned long nr_free;
> } ____cacheline_aligned;
>
> struct free_area {
> struct mt_free_area mt_free_area[MIGRATE_TYPES];
> };
>
> However, without supporting data, it’s unclear if the space increase
> is justified :-)
I designed a test model to trigger more false sharing and collected data under
it to see which layout performs better.
Test model
- Based on the microbench that was removed from mmtests
Commit: beeaeb89 ("pagealloc: Remove bit-rotted benchmark")
- Goal: Generate concurrent kernel page alloc/free activity across multiple
orders and migratetypes to observe cacheline sharing and contention in the
buddy free_area
- Mechanism: A systemtap module exposes a write-only
/proc/mmtests-pagealloc-micro. Writing a 64-bit encoded value triggers
repeated page alloc/free in kernel space
- bits 7:0 -> mt (0=UNMOVABLE, 1=MOVABLE, 2=RECLAIMABLE)
- bits 15:8 -> order
- bits 63:16 -> batch
- Workload distribution:
- order = cpu % 4 (orders 0/1/2/3)
- mt = cpu % 3 (UNMOVABLE/MOVABLE/RECLAIMABLE)
- cpu0 and cpu1 are not used for the test
- Sampling:
- load stap
- determine encoded value according to cpu id and bind it to the cpu
- after a short delay, runs 'perf mem record' for 100s
- unload stap
- Test tool:
- https://gist.github.com/zhr250/72e56f87ac703e833b11b5341d616cb0
- Data analysis tool:
- https://gist.github.com/zhr250/f4a385ffa9fae2993d22748f31e18588
CPU topo info of my machine:
Package L#0
NUMANode L#0 (P#0 15GB)
L3 L#0 (25MB)
L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#2)
PU L#3 (P#3)
L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#4)
PU L#5 (P#5)
L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#6)
PU L#7 (P#7)
L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#8)
PU L#9 (P#9)
L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#10)
PU L#11 (P#11)
L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#12)
PU L#13 (P#13)
L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#14)
PU L#15 (P#15)
L2 L#8 (2048KB)
L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#16 (P#16)
L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#17 (P#17)
L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10 + PU L#18 (P#18)
L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11 + PU L#19 (P#19)
Actual (order, mt) distribution on my machine:
order=0, mt=0: cpu12
order=0, mt=1: cpu4, cpu16
order=0, mt=2: cpu8
order=1, mt=0: cpu9
order=1, mt=1: cpu13
order=1, mt=2: cpu5, cpu17
order=2, mt=0: cpu6, cpu18
order=2, mt=1: cpu10
order=2, mt=2: cpu2, cpu14
order=3, mt=0: cpu3, cpu15
order=3, mt=1: cpu7, cpu19
order=3, mt=2: cpu11
Different migratetype/order combinations are placed on CPUs that do not share
L1/L2 caches to maximize cacheline contention. For our test goal, I think this
distribution is relatively reasonable. I ran 10 rounds for each kernel and
found the data to be stable.
Layouts tested (capturing load/store samples in free_area[0..MAX_PAGE_ORDER]):
- vanilla kernel:
struct free_area
{
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
- patched kernel:
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
+ unsigned long mt_nr_free[MIGRATE_TYPES];
};
- mtlist kernel:
+struct mt_free_list {
+ struct list_head list;
+ unsigned long nr_free;
+};
+
struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct mt_free_list mt_free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
summary:
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| Kernel | inrange samples | HitM (%) | L1 hit inc LFB/MAB (%) | L2 hit (%) | L3 hit inc HitM (%) | RAM hit (%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| vanilla | 192,468 | 45,421 (23.60%) | 94,486 (49.09%) | 1,952 (1.01%) | 91,240 (47.41%) | 4,790 (2.49%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| patched | 227,196 | 27,293 (12.01%) | 165,238 (72.73%) | 1,194 (0.53%) | 54,609 (24.04%) | 6,155 (2.71%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| mtlist | 240,694 | 50,911 (21.15%) | 132,827 (55.19%) | 3,165 (1.31%) | 98,556 (40.95%) | 6,146 (2.55%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
Detailed data:
- https://gist.github.com/zhr250/2ccf8902080ecaf85477d9c051e72a96
For both L1 hit and HitM, the patched kernel is the best among the three.
In this test model, I also collected memory allocation counts. The patched
kernel delivers the best performance — about 7.00% higher than vanilla and
4.93% higher than mtlist.
Detailed data:
https://gist.github.com/zhr250/4439523b7ca3c18f4a2d2c97b24c4965
next prev parent reply other threads:[~2026-03-03 8:05 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-28 3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28 3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-29 0:34 ` Barry Song
2025-11-28 3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
2025-11-28 12:03 ` zhongjinji
2025-11-29 0:00 ` Barry Song
2025-11-29 7:55 ` Barry Song
2025-12-01 12:29 ` Hongru Zhang
2025-12-01 18:54 ` Barry Song
2025-11-28 3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
2025-11-29 0:04 ` Barry Song
2025-11-29 9:24 ` Barry Song
2026-03-03 8:04 ` Hongru Zhang [this message]
2026-03-03 8:29 ` Hongru Zhang
2025-11-28 7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28 8:34 ` Hongru Zhang
2025-11-28 8:40 ` Lorenzo Stoakes
2025-11-28 9:24 ` Vlastimil Babka
2025-11-28 13:08 ` Johannes Weiner
2025-12-01 2:36 ` Hongru Zhang
2025-12-01 17:01 ` Zi Yan
2025-12-02 2:42 ` Hongru Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260303080423.472534-1-zhanghongru@xiaomi.com \
--to=zhanghongru06@gmail.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhanghongru@xiaomi.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox