From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
Mel Gorman <mgorman@techsingularity.net>, Zi Yan <ziy@nvidia.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH 0/5] mm: reliable huge page allocator
Date: Thu, 13 Mar 2025 17:05:31 -0400 [thread overview]
Message-ID: <20250313210647.1314586-1-hannes@cmpxchg.org> (raw)
This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.
It's a subset of the huge page allocator RFC initially proposed here:
https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/
The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:
before after
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.
A more detailed discussion of results is in the patch changelogs.
The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.
Main differences to the RFC:
- The freelist hygiene patches have since been upstreamed separately.
- The RFC version would prohibit fallbacks entirely, and make
pageblock reclaim and compaction mandatory for all allocation
contexts. This opens up a large dependency graph for compaction,
possibly remaining sources of pollution, and the handling of
low-memory situations, OOMs and deadlocks.
This version uses only kswapd & kcompactd to pre-produce pageblocks,
while still allowing last-ditch fallbacks to avoid memory deadlocks.
The long-term goal remains converging on the version proposed in the
RFC and its ~100% THP success rate. But this is reserved for future
iterations that can build on the changes proposed here.
- The RFC version proposed a new MIGRATE_FREE type as well as
per-migratetype counters. This allowed making compaction more
efficient, and the pre-compaction gap checks more precise, but again
at the cost of complex changes in an already invasive series.
This series simply uses a new vmstat counter to track the number of
free pages in whole blocks to base reclaim/compaction goals on.
- The behavior is opt-in and can be toggled at runtime. The risk for
regressions with any allocator change is sizable, and while many
users care about huge pages, obviously not all do. A runtime knob is
warranted to make the behavior optional and provide an escape hatch.
Based on today's akpm/mm-unstable.
Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.
Documentation/admin-guide/sysctl/vm.rst | 9 ++++
include/linux/compaction.h | 5 +-
include/linux/mmzone.h | 1 +
mm/compaction.c | 87 ++++++++++++++++++++-----------
mm/internal.h | 1 +
mm/page_alloc.c | 72 +++++++++++++++++++++----
mm/vmscan.c | 41 ++++++++++-----
mm/vmstat.c | 1 +
8 files changed, 161 insertions(+), 56 deletions(-)
next reply other threads:[~2025-03-13 21:07 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-13 21:05 Johannes Weiner [this message]
2025-03-13 21:05 ` [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Johannes Weiner
2025-03-14 15:08 ` Zi Yan
2025-03-16 4:28 ` Hugh Dickins
2025-03-17 18:18 ` Johannes Weiner
2025-03-21 6:21 ` kernel test robot
2025-03-21 13:55 ` Johannes Weiner
2025-04-10 15:19 ` Vlastimil Babka
2025-04-10 20:17 ` Johannes Weiner
2025-04-11 7:32 ` Vlastimil Babka
2025-03-13 21:05 ` [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Johannes Weiner
2025-03-14 18:36 ` Zi Yan
2025-03-13 21:05 ` [PATCH 3/5] mm: page_alloc: defrag_mode Johannes Weiner
2025-03-14 18:54 ` Zi Yan
2025-03-14 20:50 ` Johannes Weiner
2025-03-14 22:54 ` Zi Yan
2025-03-22 15:05 ` Brendan Jackman
2025-03-23 0:58 ` Johannes Weiner
2025-03-23 1:34 ` Johannes Weiner
2025-03-23 3:46 ` Johannes Weiner
2025-03-23 18:04 ` Brendan Jackman
2025-03-31 15:55 ` Johannes Weiner
2025-03-13 21:05 ` [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Johannes Weiner
2025-03-13 21:05 ` [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Johannes Weiner
2025-03-14 21:05 ` Johannes Weiner
2025-04-11 8:19 ` Vlastimil Babka
2025-04-11 15:39 ` Johannes Weiner
2025-04-11 16:51 ` Vlastimil Babka
2025-04-11 18:21 ` Johannes Weiner
2025-04-13 2:20 ` Johannes Weiner
2025-04-15 7:31 ` Vlastimil Babka
2025-04-15 7:44 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250313210647.1314586-1-hannes@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox