linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hongru Zhang <zhanghongru06@gmail.com>
To: 21cnbao@gmail.com
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	axelrasmussen@google.com, david@kernel.org, hannes@cmpxchg.org,
	jackmanb@google.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com, mhocko@suse.com,
	rppt@kernel.org, surenb@google.com, vbabka@suse.cz,
	weixugc@google.com, yuanchu@google.com, zhanghongru06@gmail.com,
	zhanghongru@xiaomi.com, ziy@nvidia.com
Subject: Re: [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts
Date: Tue,  3 Mar 2026 16:04:20 +0800	[thread overview]
Message-ID: <20260303080423.472534-1-zhanghongru@xiaomi.com> (raw)
In-Reply-To: <CAGsJ_4wCeLr6KOTU=Pc4ALeq5x-i0C7i6C3cSddexHw2ADSnng@mail.gmail.com>

> On Sat, Nov 29, 2025 at 8:04 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Nov 28, 2025 at 11:13 AM Hongru Zhang <zhanghongru06@gmail.com> wrote:
> > >
> > > From: Hongru Zhang <zhanghongru@xiaomi.com>
> > >
> > > Use per-migratetype counts instead of list_empty() helps reduce a
> > > few cpu instructions.
> > >
> > > Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
> > > ---
> > >  mm/internal.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 1561fc2ff5b8..7759f8fdf445 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> > >
> > >  static inline bool free_area_empty(struct free_area *area, int migratetype)
> > >  {
> > > -       return list_empty(&area->free_list[migratetype]);
> > > +       return !READ_ONCE(area->mt_nr_free[migratetype]);
> >
> > I'm not quite sure about this. Since the counter is written and read more
> > frequently, cache coherence traffic may actually be higher than for the list
> > head.
> >
> > I'd prefer to drop this unless there is real data showing it performs better.
>
> If the goal is to optimize free_area list checks and list_add,
> a reasonable approach is to organize the data structure
> to reduce false sharing between different mt and order entries.
>
> struct mt_free_area {
>         struct list_head        free_list;
>         unsigned long           nr_free;
> } ____cacheline_aligned;
>
> struct free_area {
>         struct mt_free_area     mt_free_area[MIGRATE_TYPES];
> };
>
> However, without supporting data, it’s unclear if the space increase
> is justified :-)

I designed a test model to trigger more false sharing and collected data under
it to see which layout performs better.

Test model
- Based on the microbench that was removed from mmtests
  Commit: beeaeb89 ("pagealloc: Remove bit-rotted benchmark")
- Goal: Generate concurrent kernel page alloc/free activity across multiple
  orders and migratetypes to observe cacheline sharing and contention in the
  buddy free_area
- Mechanism: A systemtap module exposes a write-only
  /proc/mmtests-pagealloc-micro. Writing a 64-bit encoded value triggers
  repeated page alloc/free in kernel space
  - bits 7:0   -> mt (0=UNMOVABLE, 1=MOVABLE, 2=RECLAIMABLE)
  - bits 15:8  -> order
  - bits 63:16 -> batch
- Workload distribution:
  - order = cpu % 4 (orders 0/1/2/3)
  - mt    = cpu % 3 (UNMOVABLE/MOVABLE/RECLAIMABLE)
  - cpu0 and cpu1 are not used for the test
- Sampling:
  - load stap
  - determine encoded value according to cpu id and bind it to the cpu
  - after a short delay, runs 'perf mem record' for 100s
  - unload stap
- Test tool:
  - https://gist.github.com/zhr250/72e56f87ac703e833b11b5341d616cb0
- Data analysis tool:
  - https://gist.github.com/zhr250/f4a385ffa9fae2993d22748f31e18588

CPU topo info of my machine:
  Package L#0
    NUMANode L#0 (P#0 15GB)
    L3 L#0 (25MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
      L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
      L2 L#8 (2048KB)
        L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#16 (P#16)
        L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#17 (P#17)
        L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10 + PU L#18 (P#18)
        L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11 + PU L#19 (P#19)

Actual (order, mt) distribution on my machine:
  order=0, mt=0: cpu12
  order=0, mt=1: cpu4, cpu16
  order=0, mt=2: cpu8
  order=1, mt=0: cpu9
  order=1, mt=1: cpu13
  order=1, mt=2: cpu5, cpu17
  order=2, mt=0: cpu6, cpu18
  order=2, mt=1: cpu10
  order=2, mt=2: cpu2, cpu14
  order=3, mt=0: cpu3, cpu15
  order=3, mt=1: cpu7, cpu19
  order=3, mt=2: cpu11

Different migratetype/order combinations are placed on CPUs that do not share
L1/L2 caches to maximize cacheline contention. For our test goal, I think this
distribution is relatively reasonable. I ran 10 rounds for each kernel and
found the data to be stable.

Layouts tested (capturing load/store samples in free_area[0..MAX_PAGE_ORDER]):
- vanilla kernel:
    struct free_area
    {
            struct list_head free_list[MIGRATE_TYPES];
            unsigned long    nr_free;
    };
- patched kernel:
    struct free_area {
            struct list_head free_list[MIGRATE_TYPES];
            unsigned long    nr_free;
   +	    unsigned long    mt_nr_free[MIGRATE_TYPES];
    };
- mtlist kernel:
   +struct mt_free_list {
   +        struct list_head list;
   +        unsigned long    nr_free;
   +};
   +
    struct free_area {
   -        struct list_head    free_list[MIGRATE_TYPES];
   +        struct mt_free_list	mt_free_list[MIGRATE_TYPES];
            unsigned long       nr_free;
    };

summary:
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
|  Kernel | inrange samples |     HitM (%)    | L1 hit inc LFB/MAB (%) |   L2 hit (%)  | L3 hit inc HitM (%) |  RAM hit (%)  |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| vanilla |     192,468     | 45,421 (23.60%) |     94,486 (49.09%)    | 1,952 (1.01%) |   91,240 (47.41%)   | 4,790 (2.49%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| patched |     227,196     | 27,293 (12.01%) |    165,238 (72.73%)    | 1,194 (0.53%) |   54,609 (24.04%)   | 6,155 (2.71%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
|  mtlist |     240,694     | 50,911 (21.15%) |    132,827 (55.19%)    | 3,165 (1.31%) |   98,556 (40.95%)   | 6,146 (2.55%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+

Detailed data:
- https://gist.github.com/zhr250/2ccf8902080ecaf85477d9c051e72a96

For both L1 hit and HitM, the patched kernel is the best among the three.

In this test model, I also collected memory allocation counts. The patched
kernel delivers the best performance — about 7.00% higher than vanilla and
4.93% higher than mtlist.

Detailed data:
https://gist.github.com/zhr250/4439523b7ca3c18f4a2d2c97b24c4965


  reply	other threads:[~2026-03-03  8:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28  3:10 [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Hongru Zhang
2025-11-28  3:11 ` [PATCH 1/3] mm/page_alloc: add per-migratetype counts to buddy allocator Hongru Zhang
2025-11-29  0:34   ` Barry Song
2025-11-28  3:12 ` [PATCH 2/3] mm/vmstat: get fragmentation statistics from per-migragetype count Hongru Zhang
2025-11-28 12:03   ` zhongjinji
2025-11-29  0:00     ` Barry Song
2025-11-29  7:55       ` Barry Song
2025-12-01 12:29       ` Hongru Zhang
2025-12-01 18:54         ` Barry Song
2025-11-28  3:12 ` [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts Hongru Zhang
2025-11-29  0:04   ` Barry Song
2025-11-29  9:24     ` Barry Song
2026-03-03  8:04       ` Hongru Zhang [this message]
2026-03-03  8:29     ` Hongru Zhang
2025-11-28  7:49 ` [PATCH 0/3] mm: add per-migratetype counts to buddy allocator and optimize pagetypeinfo access Lorenzo Stoakes
2025-11-28  8:34   ` Hongru Zhang
2025-11-28  8:40     ` Lorenzo Stoakes
2025-11-28  9:24 ` Vlastimil Babka
2025-11-28 13:08   ` Johannes Weiner
2025-12-01  2:36   ` Hongru Zhang
2025-12-01 17:01     ` Zi Yan
2025-12-02  2:42       ` Hongru Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260303080423.472534-1-zhanghongru@xiaomi.com \
    --to=zhanghongru06@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhanghongru@xiaomi.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox