[RFC PATCH v4 0/9] mm: Hot page tracking and promotion infrastructure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
	<gourry@gourry.net>, <mgorman@techsingularity.net>,
	<mingo@redhat.com>, <peterz@infradead.org>,
	<raghavendra.kt@amd.com>, <riel@surriel.com>,
	<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
	<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
	<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
	<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
	<akpm@linux-foundation.org>, <david@redhat.com>,
	<byungchul@sk.com>, <kinseyho@google.com>,
	<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
	<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
	<shivankg@amd.com>, Bharata B Rao <bharata@amd.com>
Subject: [RFC PATCH v4 0/9] mm: Hot page tracking and promotion infrastructure
Date: Sat, 6 Dec 2025 15:44:14 +0530	[thread overview]
Message-ID: <20251206101423.5004-1-bharata@amd.com> (raw)

Hi,

This is v4 of page hotness tracking and promotion sub-system pghot.

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults, page table
  scans, hardware hints (IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lowertier-node kernel thread.
- Move migration rate limiting and associated logic in NUMAB=2 (current NUMA
  Balancing based hot page promotion) from scheduler to pghot sub-system to
  enable broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses
- Shared infrastructure for tracking hotness at PFN granularity
- Per-lowertier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency, last access time and accessing node for each recorded
  access.
- These hotness parameters are maintained on a per-PFN in an unsigned long
  variable within the existing mem_section data structure.
  Bits 0-31 are used to store nid, frequency and time.
  Bits 32-62 are unused now.
  Bit 63 is used to indicate the page is ready for migration.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the ready bit.
- Per-lowertier-node kmigrated threads periodically scan the PFNs of lower tier
  nodes, checking for the migration-ready bit to perform batched migrations.

Four page hotness sources have been integrated with pghot subsystem on
experimental basis:

1. IBS
2. klruscand (based on MGLRU page table walks)
3. NUMA Balancing (mode 2).
4. folio_mark_accessed()

Changes in v4
=============
- Addition of folio_mark_accessed() as source to track and promote unmapped
  page cache pages.
- Per-section indicator for hotness based on which a section is taken up for
  scanning. This should considerably reduce the scanning effort by kmigrated.
  The LSB of the pointer used to store the hotness data for each section is
  reprovisioned as section hotness indicator.
- Added a file under admin-guide to document the usage of pghot sub-system.
- HWhint source IBS is under its own config option now.
- All vmstat counters are under appropriate config options now.
- Most tunables are moved to a dedicated debugfs dir.
- Some code cleanup.

Results
=======
System details
--------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2 
  0:  10  32  50 
  1:  32  10  60 
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
	 in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
	 use of hint faults as source in the patched case.
pgtscan - Klruscand (MGLRU based PTE A bit scanning) source
hwhints - IBS as source
FMA - folio_mark_accessed()

==============================================================
Scenario 1 - Enough memory in toptier and hence only promotion
==============================================================
Microbenchmark details
----------------------
Multi-threaded application with 64 threads that access memory at 4K granularity
repetitively and randomly. The number of accesses per thread and the randomness
pattern for each thread are fixed beforehand. The accesses are divided into stores
and loads.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 2 before the accesses start. There are three modes in which the
benchmark is run:

Mode 1: Regular 4K page accesses. The memory is provisioned on CXL node using
mmap(MAP_POPULATE). 50% loads and 50% stores.

Mode 2: mmapped file 4K accesses. The memory is provisioned on CXL node using
mmap(fd, MAP_POPULATE|MAP_SHARED). 100% loads.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to finish
the accesses in microseconds. The sooner it finishes the better it is. All the
numbers shown below are average of 3 runs.

Mode 1 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		118,986,471	116,240,187	-2.3%
NUMAB2		104,025,651	105,636,591	+1.5%
pgtscan		NA		110,800,511	NA
hwhints		NA		100,442,082	NA
------------------------------------------------------

Mode 1 - Pages migrated (pgpromote_success)
---------------------------------------
Source		Base		Patched
---------------------------------------
NUMAB0		0		0
NUMAB2		2097152		2097152
pgtscan		NA		2097152
hwhints		NA		1232876
---------------------------------------

Mode 2 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		113,352,595	110,053,021	-2.9%
NUMAB2		72,339,008	84,999,971	+17.5%
pgtscan		NA		66,189,266	NA
hwhints		NA		71,644,577	NA
------------------------------------------------------

Mode 2 - Pages migrated (pgpromote_success)
---------------------------------------
Source		Base		Patched
---------------------------------------
NUMAB0		0		0
NUMAB2		2097152		2095978
pgtscan		NA		1993077
hwhints		NA		2097129
---------------------------------------

==============================================================
Scenario 2 - Toptier memory overcommited, promotion + demotion
==============================================================
Single threaded application that allocates memory on both DRAM and CXL nodes
using mmap(MAP_POPULATE). Every 1G region of allocated memory on CXL node is
accessed at 4K granularity randomly and repetitively to build up the notion
of hotness in the 1GB region that is under access. This should drive promotion.
For promotion to work successfully, the DRAM memory that has been provisioned
(and not being accessed) should be demoted first. There is enough free memory
in the CXL node to for demotions.

In summary, this benchmark creates a memory pressure on DRAM node and does
CXL memory accesses to drive both demotion and promotion.

The number of accesses are fixed and hence, the quicker the accessed pages
get promoted to DRAM, the sooner the benchmark is expected to finish.

DRAM-node			= 1
CXL-node			= 2
Initial DRAM alloc ratio	= 75%
Allocation-size			= 171798691840
Initial DRAM Alloc-size	=	 128849018880
Initial CXL Alloc-size		= 42949672960
Hot-region-size			= 1073741824
Nr-regions			= 160
Nr-regions DRAM			= 120 (provisioned but not accessed)
Nr-hot-regions CXL		= 40
Access pattern			= random
Access granularity		= 4096
Delay b/n accesses		= 0
Load/store ratio		= 50l50s
THP used			= no
Nr accesses			= 42949672960
Nr repetitions			= 1024

Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		61,537,418	59,165,269	-3.8%
NUMAB2		62,070,563	63,087,940	+1.6%
pgtscan		NA		66,886,552	NA
hwhints		NA		63,35,4394	NA
------------------------------------------------------

Pages migrated (pgpromote_success)
---------------------------------------
Source		Base		Patched
---------------------------------------
NUMAB0		0		0
NUMAB2		0		0
pgtscan		NA		6481483
hwhints		NA		304
---------------------------------------

===============================================================
Scenario 3 - Numbers from folio_mark_accessed() (FMA) as source
===============================================================
Single threaded microbenchmark that provisions a file of 2G size on
CXL node initially, runs on Node 0 and reads random file pages at
4k granularity iteratively and repetitively. FMA source detects the
reads on unmapped page cache pages residing on CXL node and mark
them for promotion.
------------------------------------------------------------------
			Base		Patched		Patched
					FMA source	FMA cource
					Disabled	Enabled
------------------------------------------------------------------
Time taken(us)		96,511,260	119,332,436	82,807,865
pgpromote_success	0		0		524242
------------------------------------------------------------------

Results summary
===============
- The observations from v3 pretty much remain the same for 1st
  and 2nd scenarios.
- FMA source: Compared to the base kernel, the time taken to complete
  the file accesses decreases with promotion of file pages in the patched
  version. However when the FMA source isn't enabled we see a regression
  compared to base that needs to be investigated.

This v4 patchset applies on top of upstream commit 4941a17751c9 and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-rfcv4

v3: https://lore.kernel.org/linux-mm/20251110052343.208768-1-bharata@amd.com/
v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

TODOs
=====
- Check if the page is still within the hotness time window when
  kmigrated gets to it.
- Bulk access reporting may be desirable for sources like IBS.
- Take care of memory hotplug for allocation/freeing of mem_section->hot_map.
- Currently I am defaulting to node 0 if target NID isn't specified by the
  source. The best fallback target node may have to determined dynamically.
- Provide compatibility alias for the sysctls moved from sched to pghot.
- Wider testing and benchmark coverage.
- Address Ying Huang's comment about merging migrate_misplaced_folio()
  and migrate_misplaced_folios_batch() and correctly handling memcg
  stats counting properly in the latter.


Bharata B Rao (6):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses
  mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking
  mm: pghot: Add folio_mark_accessed() as hotness source

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 Documentation/admin-guide/mm/pghot.txt |  64 +++
 arch/x86/events/amd/ibs.c              |  10 +
 arch/x86/include/asm/entry-common.h    |   3 +
 arch/x86/include/asm/hardirq.h         |   2 +
 arch/x86/include/asm/msr-index.h       |  16 +
 arch/x86/mm/Makefile                   |   1 +
 arch/x86/mm/ibs.c                      | 348 ++++++++++++++++
 include/linux/migrate.h                |   6 +
 include/linux/mmzone.h                 |  19 +
 include/linux/pghot.h                  |  87 ++++
 include/linux/vm_event_item.h          |  26 ++
 kernel/sched/debug.c                   |   1 -
 kernel/sched/fair.c                    | 152 +------
 mm/Kconfig                             |  32 ++
 mm/Makefile                            |   2 +
 mm/huge_memory.c                       |  26 +-
 mm/internal.h                          |   4 +
 mm/klruscand.c                         | 110 +++++
 mm/memory.c                            |  31 +-
 mm/migrate.c                           |  41 +-
 mm/mm_init.c                           |  10 +
 mm/pghot-debug.c                       | 187 +++++++++
 mm/pghot.c                             | 533 +++++++++++++++++++++++++
 mm/swap.c                              |   8 +
 mm/vmscan.c                            | 181 ++++++---
 mm/vmstat.c                            |  26 ++
 26 files changed, 1686 insertions(+), 240 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/pghot.txt
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot-debug.c
 create mode 100644 mm/pghot.c

-- 
2.34.1

next             reply	other threads:[~2025-12-06 10:15 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-06 10:14 Bharata B Rao [this message]
2025-12-06 10:14 ` [RFC PATCH v4 1/9] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 2/9] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 3/9] mm: Hot page tracking and promotion Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 4/9] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 5/9] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 6/9] mm: mglru: generalize page table walk Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 7/9] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 8/9] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 9/9] mm: pghot: Add folio_mark_accessed() as hotness source Bharata B Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251206101423.5004-1-bharata@amd.com \
    --to=bharata@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox