linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: reliable huge page allocator
@ 2025-03-13 21:05 Johannes Weiner
  2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Johannes Weiner @ 2025-03-13 21:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Mel Gorman, Zi Yan, linux-mm, linux-kernel

This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.

It's a subset of the huge page allocator RFC initially proposed here:

  https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/

The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:

                                                     before                   after
    Hugealloc Time mean               52739.45 (    +0.00%)   28904.00 (   -45.19%)
    Hugealloc Time stddev             56541.26 (    +0.00%)   33464.37 (   -40.81%)
    Kbuild Real time                    197.47 (    +0.00%)     196.59 (    -0.44%)
    Kbuild User time                   1240.49 (    +0.00%)    1231.67 (    -0.71%)
    Kbuild System time                   70.08 (    +0.00%)      59.10 (   -15.45%)
    THP fault alloc                   46727.07 (    +0.00%)   63223.67 (   +35.30%)
    THP fault fallback                21910.60 (    +0.00%)    5412.47 (   -75.29%)
    Direct compact fail                 195.80 (    +0.00%)      59.07 (   -69.48%)
    Direct compact success                7.93 (    +0.00%)       2.80 (   -57.46%)
    Direct compact success rate %         3.51 (    +0.00%)       3.99 (   +10.49%)
    Compact daemon scanned migrate  3369601.27 (    +0.00%) 2267500.33 (   -32.71%)
    Compact daemon scanned free     5075474.47 (    +0.00%) 2339773.00 (   -53.90%)
    Compact direct scanned migrate   161787.27 (    +0.00%)   47659.93 (   -70.54%)
    Compact direct scanned free      163467.53 (    +0.00%)   40729.67 (   -75.08%)
    Compact total migrate scanned   3531388.53 (    +0.00%) 2315160.27 (   -34.44%)
    Compact total free scanned      5238942.00 (    +0.00%) 2380502.67 (   -54.56%)
    Alloc stall                        2371.07 (    +0.00%)     638.87 (   -73.02%)
    Pages kswapd scanned            2160926.73 (    +0.00%) 4002186.33 (   +85.21%)
    Pages kswapd reclaimed           533191.07 (    +0.00%)  718577.80 (   +34.77%)
    Pages direct scanned             400450.33 (    +0.00%)  355172.73 (   -11.31%)
    Pages direct reclaimed            94441.73 (    +0.00%)   31162.80 (   -67.00%)
    Pages total scanned             2561377.07 (    +0.00%) 4357359.07 (   +70.12%)
    Pages total reclaimed            627632.80 (    +0.00%)  749740.60 (   +19.46%)
    Swap out                          47959.53 (    +0.00%)  110084.33 (  +129.53%)
    Swap in                            7276.00 (    +0.00%)   24457.00 (  +236.10%)
    File refaults                    138043.00 (    +0.00%)  188226.93 (   +36.35%)

THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.

A more detailed discussion of results is in the patch changelogs.

The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.

Main differences to the RFC:

- The freelist hygiene patches have since been upstreamed separately.

- The RFC version would prohibit fallbacks entirely, and make
  pageblock reclaim and compaction mandatory for all allocation
  contexts. This opens up a large dependency graph for compaction,
  possibly remaining sources of pollution, and the handling of
  low-memory situations, OOMs and deadlocks.

  This version uses only kswapd & kcompactd to pre-produce pageblocks,
  while still allowing last-ditch fallbacks to avoid memory deadlocks.

  The long-term goal remains converging on the version proposed in the
  RFC and its ~100% THP success rate. But this is reserved for future
  iterations that can build on the changes proposed here.

- The RFC version proposed a new MIGRATE_FREE type as well as
  per-migratetype counters. This allowed making compaction more
  efficient, and the pre-compaction gap checks more precise, but again
  at the cost of complex changes in an already invasive series.

  This series simply uses a new vmstat counter to track the number of
  free pages in whole blocks to base reclaim/compaction goals on.

- The behavior is opt-in and can be toggled at runtime. The risk for
  regressions with any allocator change is sizable, and while many
  users care about huge pages, obviously not all do. A runtime knob is
  warranted to make the behavior optional and provide an escape hatch.

Based on today's akpm/mm-unstable.

Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.

 Documentation/admin-guide/sysctl/vm.rst |  9 ++++
 include/linux/compaction.h              |  5 +-
 include/linux/mmzone.h                  |  1 +
 mm/compaction.c                         | 87 ++++++++++++++++++++-----------
 mm/internal.h                           |  1 +
 mm/page_alloc.c                         | 72 +++++++++++++++++++++----
 mm/vmscan.c                             | 41 ++++++++++-----
 mm/vmstat.c                             |  1 +
 8 files changed, 161 insertions(+), 56 deletions(-)



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-04-15  7:44 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-13 21:05 [PATCH 0/5] mm: reliable huge page allocator Johannes Weiner
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08   ` Zi Yan
2025-03-16  4:28   ` Hugh Dickins
2025-03-17 18:18     ` Johannes Weiner
2025-03-21  6:21   ` kernel test robot
2025-03-21 13:55     ` Johannes Weiner
2025-04-10 15:19   ` Vlastimil Babka
2025-04-10 20:17     ` Johannes Weiner
2025-04-11  7:32       ` Vlastimil Babka
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
2025-03-14 18:36   ` Zi Yan
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
2025-03-14 18:54   ` Zi Yan
2025-03-14 20:50     ` Johannes Weiner
2025-03-14 22:54       ` Zi Yan
2025-03-22 15:05   ` Brendan Jackman
2025-03-23  0:58     ` Johannes Weiner
2025-03-23  1:34       ` Johannes Weiner
2025-03-23  3:46         ` Johannes Weiner
2025-03-23 18:04           ` Brendan Jackman
2025-03-31 15:55             ` Johannes Weiner
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
2025-03-14 21:05   ` Johannes Weiner
2025-04-11  8:19   ` Vlastimil Babka
2025-04-11 15:39     ` Johannes Weiner
2025-04-11 16:51       ` Vlastimil Babka
2025-04-11 18:21         ` Johannes Weiner
2025-04-13  2:20           ` Johannes Weiner
2025-04-15  7:31             ` Vlastimil Babka
2025-04-15  7:44             ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox