linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
@ 2025-09-10 14:46 Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

Hi,

This patchset introduces a new subsystem for hot page tracking
and promotion (pghot) that consolidates memory access information
from various sources and enables centralized promotion of hot
pages across memory tiers.

Currently, multiple kernel subsystems detect page accesses
independently. For eg.

- NUMA Balancing via hint faults
- MGLRU via page table scanning for PTE A bit

This patchset consolidates the accesses from these mechanisms by
providing a common API for reporting page accesses and a shared
infrastructure for tracking hotness at PFN granularity and per-node
kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency, last access time and accessing node as
  part of each access record.
- Maintains per-PFN access records in hash lists.
- Classifies pages as hot based on configurable thresholds.
- Uses per-toptier-node max-heaps to prioritize hot pages for promotion.
- Launches per-toptier-node kpromoted threads to perform batched
  migrations.

When different subsystems report page accesses via the API
introduced by this new subsystem, a record for each such page
is stored in hash lists (hashed by PFN value). In addition to
the PFN and target_nid, the hotness record includes parameters
like frequency and time of access from which the hotness is
derived. Repeated reporting of access on the same PFN will result
in updating of hotness information. When the hotness of a
record (as updated during reporting of access) crosses a threshold,
the record becomes part of a max heap data structure. Records
in the max heap are arranged based on the hotness and hence
the top elements of the heap will correspond to the hottest
pages. There will be one such heap for each toptier node so
that per-toptier-node kpromoted thread can easily extract the
top N records from its own heap and perform batched migration.

Three page hotness sources have been integrated with pghot
subsystem on experimental basis:

1. IBS
2. klruscand (based on MGLRU page table walks)
3. NUMA Balancing (mode 2).

Changes in v2
=============
- Moved migration rate-limiting and dynamic threshold logic from
  NUMA Balancing subsystem to pghot. With this, the logic to
  classify a page as hot resembles more closely to the existing
  mechanism.
- Converted NUMA Balancing mode 2 to just detect accesses through
  NUMA hint faults and delegate rest of the processing (hot page
  classification and promotion) to pghot.
- Packed the three parameters required for hot page tracking
  (nid, frequency and timestamp) into a single u32 for space
  efficiency.
- Misc cleanups and refactoring.

This v2 patchset applies on top of upstream commit 8742b2d8935f and
can be fetched from:
https://github.com/AMDESE/linux-mm/tree/bharata/kpromoted-rfcv2

v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

TODOs
=====
- Memory allocation: High volume of allocations and frees (millions)
  from atomic context needs evaluation.
- Memory overhead: The amount of data needed for tracking hotness is
  also a concern.
- Integrate Kscand[1], the PTE A bit based approach that Raghavendra KT
  is working upon, so that Kscand acts as temperature sources and
  uses pghot for hot page heuristics and promotion.
- Heap pruning: Consider adding heap pruning mechanism for periodic
  cleaning of cold records.
- Address Ying Huang's comment about merging migrate_misplaced_folio()
  and migrate_misplaced_folios_batch() and correctly handling memcg
  stats counting properly in the latter.
- Testing: Light functional testing done; performance benchmarking and
  stress testing will follow in the next iterations.

Any feedback is welcome!

Bharata B Rao (5):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses
  mm: sched: Move hot page promotion from NUMAB=2 to kpromoted

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 +
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 343 +++++++++++++++
 include/linux/migrate.h             |   6 +
 include/linux/mmzone.h              |  16 +
 include/linux/pghot.h               |  98 +++++
 include/linux/vm_event_item.h       |  26 ++
 kernel/sched/fair.c                 | 149 +------
 mm/Kconfig                          |  19 +
 mm/Makefile                         |   2 +
 mm/internal.h                       |   4 +
 mm/klruscand.c                      | 118 +++++
 mm/memory.c                         |  32 +-
 mm/migrate.c                        |  36 +-
 mm/mm_init.c                        |  10 +
 mm/pghot.c                          | 648 ++++++++++++++++++++++++++++
 mm/vmscan.c                         | 176 ++++++--
 mm/vmstat.c                         |  26 ++
 22 files changed, 1535 insertions(+), 218 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot.c

[1] Kscand - https://lore.kernel.org/linux-mm/20250814153307.1553061-1-raghavendra.kt@amd.com/
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-10-23 15:29 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox