From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
<gourry@gourry.net>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<shivankg@amd.com>, Bharata B Rao <bharata@amd.com>
Subject: [RFC PATCH v4 0/9] mm: Hot page tracking and promotion infrastructure
Date: Sat, 6 Dec 2025 15:44:14 +0530 [thread overview]
Message-ID: <20251206101423.5004-1-bharata@amd.com> (raw)
Hi,
This is v4 of page hotness tracking and promotion sub-system pghot.
This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:
- Unify hot page detection from multiple sources like hint faults, page table
scans, hardware hints (IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lowertier-node kernel thread.
- Move migration rate limiting and associated logic in NUMAB=2 (current NUMA
Balancing based hot page promotion) from scheduler to pghot sub-system to
enable broader reuse.
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:
- A common API for reporting page accesses
- Shared infrastructure for tracking hotness at PFN granularity
- Per-lowertier-node kernel threads for promoting pages.
Here is a brief summary of how this subsystem works:
- Tracks frequency, last access time and accessing node for each recorded
access.
- These hotness parameters are maintained on a per-PFN in an unsigned long
variable within the existing mem_section data structure.
Bits 0-31 are used to store nid, frequency and time.
Bits 32-62 are unused now.
Bit 63 is used to indicate the page is ready for migration.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the ready bit.
- Per-lowertier-node kmigrated threads periodically scan the PFNs of lower tier
nodes, checking for the migration-ready bit to perform batched migrations.
Four page hotness sources have been integrated with pghot subsystem on
experimental basis:
1. IBS
2. klruscand (based on MGLRU page table walks)
3. NUMA Balancing (mode 2).
4. folio_mark_accessed()
Changes in v4
=============
- Addition of folio_mark_accessed() as source to track and promote unmapped
page cache pages.
- Per-section indicator for hotness based on which a section is taken up for
scanning. This should considerably reduce the scanning effort by kmigrated.
The LSB of the pointer used to store the hotness data for each section is
reprovisioned as section hotness indicator.
- Added a file under admin-guide to document the usage of pghot sub-system.
- HWhint source IBS is under its own config option now.
- All vmstat counters are under appropriate config options now.
- Most tunables are moved to a dedicated debugfs dir.
- Some code cleanup.
Results
=======
System details
--------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)
$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node 0 1 2
0: 10 32 50
1: 32 10 60
2: 255 255 10
Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
use of hint faults as source in the patched case.
pgtscan - Klruscand (MGLRU based PTE A bit scanning) source
hwhints - IBS as source
FMA - folio_mark_accessed()
==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Microbenchmark details
----------------------
Multi-threaded application with 64 threads that access memory at 4K granularity
repetitively and randomly. The number of accesses per thread and the randomness
pattern for each thread are fixed beforehand. The accesses are divided into stores
and loads.
Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 2 before the accesses start. There are three modes in which the
benchmark is run:
Mode 1: Regular 4K page accesses. The memory is provisioned on CXL node using
mmap(MAP_POPULATE). 50% loads and 50% stores.
Mode 2: mmapped file 4K accesses. The memory is provisioned on CXL node using
mmap(fd, MAP_POPULATE|MAP_SHARED). 100% loads.
Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to finish
the accesses in microseconds. The sooner it finishes the better it is. All the
numbers shown below are average of 3 runs.
Mode 1 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source Base Patched Change
------------------------------------------------------
NUMAB0 118,986,471 116,240,187 -2.3%
NUMAB2 104,025,651 105,636,591 +1.5%
pgtscan NA 110,800,511 NA
hwhints NA 100,442,082 NA
------------------------------------------------------
Mode 1 - Pages migrated (pgpromote_success)
---------------------------------------
Source Base Patched
---------------------------------------
NUMAB0 0 0
NUMAB2 2097152 2097152
pgtscan NA 2097152
hwhints NA 1232876
---------------------------------------
Mode 2 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source Base Patched Change
------------------------------------------------------
NUMAB0 113,352,595 110,053,021 -2.9%
NUMAB2 72,339,008 84,999,971 +17.5%
pgtscan NA 66,189,266 NA
hwhints NA 71,644,577 NA
------------------------------------------------------
Mode 2 - Pages migrated (pgpromote_success)
---------------------------------------
Source Base Patched
---------------------------------------
NUMAB0 0 0
NUMAB2 2097152 2095978
pgtscan NA 1993077
hwhints NA 2097129
---------------------------------------
==============================================================
Scenario 2 - Toptier memory overcommited, promotion + demotion
==============================================================
Single threaded application that allocates memory on both DRAM and CXL nodes
using mmap(MAP_POPULATE). Every 1G region of allocated memory on CXL node is
accessed at 4K granularity randomly and repetitively to build up the notion
of hotness in the 1GB region that is under access. This should drive promotion.
For promotion to work successfully, the DRAM memory that has been provisioned
(and not being accessed) should be demoted first. There is enough free memory
in the CXL node to for demotions.
In summary, this benchmark creates a memory pressure on DRAM node and does
CXL memory accesses to drive both demotion and promotion.
The number of accesses are fixed and hence, the quicker the accessed pages
get promoted to DRAM, the sooner the benchmark is expected to finish.
DRAM-node = 1
CXL-node = 2
Initial DRAM alloc ratio = 75%
Allocation-size = 171798691840
Initial DRAM Alloc-size = 128849018880
Initial CXL Alloc-size = 42949672960
Hot-region-size = 1073741824
Nr-regions = 160
Nr-regions DRAM = 120 (provisioned but not accessed)
Nr-hot-regions CXL = 40
Access pattern = random
Access granularity = 4096
Delay b/n accesses = 0
Load/store ratio = 50l50s
THP used = no
Nr accesses = 42949672960
Nr repetitions = 1024
Time taken (microseconds, lower is better)
------------------------------------------------------
Source Base Patched Change
------------------------------------------------------
NUMAB0 61,537,418 59,165,269 -3.8%
NUMAB2 62,070,563 63,087,940 +1.6%
pgtscan NA 66,886,552 NA
hwhints NA 63,35,4394 NA
------------------------------------------------------
Pages migrated (pgpromote_success)
---------------------------------------
Source Base Patched
---------------------------------------
NUMAB0 0 0
NUMAB2 0 0
pgtscan NA 6481483
hwhints NA 304
---------------------------------------
===============================================================
Scenario 3 - Numbers from folio_mark_accessed() (FMA) as source
===============================================================
Single threaded microbenchmark that provisions a file of 2G size on
CXL node initially, runs on Node 0 and reads random file pages at
4k granularity iteratively and repetitively. FMA source detects the
reads on unmapped page cache pages residing on CXL node and mark
them for promotion.
------------------------------------------------------------------
Base Patched Patched
FMA source FMA cource
Disabled Enabled
------------------------------------------------------------------
Time taken(us) 96,511,260 119,332,436 82,807,865
pgpromote_success 0 0 524242
------------------------------------------------------------------
Results summary
===============
- The observations from v3 pretty much remain the same for 1st
and 2nd scenarios.
- FMA source: Compared to the base kernel, the time taken to complete
the file accesses decreases with promotion of file pages in the patched
version. However when the FMA source isn't enabled we see a regression
compared to base that needs to be investigated.
This v4 patchset applies on top of upstream commit 4941a17751c9 and
can be fetched from:
https://github.com/AMDESE/linux-mm/tree/bharata/pghot-rfcv4
v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/
TODOs
=====
- Check if the page is still within the hotness time window when
kmigrated gets to it.
- Bulk access reporting may be desirable for sources like IBS.
- Take care of memory hotplug for allocation/freeing of mem_section->hot_map.
- Currently I am defaulting to node 0 if target NID isn't specified by the
source. The best fallback target node may have to determined dynamically.
- Provide compatibility alias for the sysctls moved from sched to pghot.
- Wider testing and benchmark coverage.
- Address Ying Huang's comment about merging migrate_misplaced_folio()
and migrate_misplaced_folios_batch() and correctly handling memcg
stats counting properly in the latter.
Bharata B Rao (6):
mm: migrate: Allow misplaced migration without VMA too
mm: Hot page tracking and promotion
x86: ibs: In-kernel IBS driver for memory access profiling
x86: ibs: Enable IBS profiling for memory accesses
mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking
mm: pghot: Add folio_mark_accessed() as hotness source
Gregory Price (1):
migrate: implement migrate_misplaced_folios_batch
Kinsey Ho (2):
mm: mglru: generalize page table walk
mm: klruscand: use mglru scanning for page promotion
Documentation/admin-guide/mm/pghot.txt | 64 +++
arch/x86/events/amd/ibs.c | 10 +
arch/x86/include/asm/entry-common.h | 3 +
arch/x86/include/asm/hardirq.h | 2 +
arch/x86/include/asm/msr-index.h | 16 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/ibs.c | 348 ++++++++++++++++
include/linux/migrate.h | 6 +
include/linux/mmzone.h | 19 +
include/linux/pghot.h | 87 ++++
include/linux/vm_event_item.h | 26 ++
kernel/sched/debug.c | 1 -
kernel/sched/fair.c | 152 +------
mm/Kconfig | 32 ++
mm/Makefile | 2 +
mm/huge_memory.c | 26 +-
mm/internal.h | 4 +
mm/klruscand.c | 110 +++++
mm/memory.c | 31 +-
mm/migrate.c | 41 +-
mm/mm_init.c | 10 +
mm/pghot-debug.c | 187 +++++++++
mm/pghot.c | 533 +++++++++++++++++++++++++
mm/swap.c | 8 +
mm/vmscan.c | 181 ++++++---
mm/vmstat.c | 26 ++
26 files changed, 1686 insertions(+), 240 deletions(-)
create mode 100644 Documentation/admin-guide/mm/pghot.txt
create mode 100644 arch/x86/mm/ibs.c
create mode 100644 include/linux/pghot.h
create mode 100644 mm/klruscand.c
create mode 100644 mm/pghot-debug.c
create mode 100644 mm/pghot.c
--
2.34.1
next reply other threads:[~2025-12-06 10:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-06 10:14 Bharata B Rao [this message]
2025-12-06 10:14 ` [RFC PATCH v4 1/9] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 2/9] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 3/9] mm: Hot page tracking and promotion Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 4/9] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 5/9] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 6/9] mm: mglru: generalize page table walk Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 7/9] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 8/9] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 9/9] mm: pghot: Add folio_mark_accessed() as hotness source Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251206101423.5004-1-bharata@amd.com \
--to=bharata@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox