* [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
@ 2026-04-18 12:02 Barry Song (Xiaomi)
0 siblings, 0 replies; only message in thread
From: Barry Song (Xiaomi) @ 2026-04-18 12:02 UTC (permalink / raw)
To: akpm, linux-mm
Cc: linux-kernel, Barry Song (Xiaomi),
Lance Yang, Xueyuan Chen, Kairui Song, Qi Zheng, Shakeel Butt,
wangzicheng, Suren Baghdasaryan, Lei Liu, Matthew Wilcox,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon
MGLRU gives high priority to folios mapped in page tables.
As a result, folio_set_active() is invoked for all folios
read during page faults. In practice, however, readahead
can bring in many folios that are never accessed via page
tables.
A previous attempt by Lei Liu proposed introducing a separate
LRU for readahead[1] to make readahead pages easier to reclaim,
but that approach is likely over-engineered.
Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
protection"), folios with PG_active were always placed in
the youngest generation, leading to over-protection and
increased refaults. After that commit, PG_active folios
are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In
contrast, the classic active/inactive scheme is more
conservative.
This patch switches to folio_mark_accessed(). If
folio_check_references() later detects referenced PTEs,
the folio will be promoted based on the reference flag
set by folio_mark_accessed().
The following uses a simple model to demonstrate why the current
code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
strided pattern—4KB every 64KB—to simulate prefaulted pages that may
not be accessed.
#!/bin/bash
CG_NAME="mglru_verify_test"
CG_PATH="/sys/fs/cgroup/$CG_NAME"
MEM_LIMIT="400M"
HOT_SIZE="600M"
# 1. Environment Setup
sudo rmdir "$CG_PATH" 2>/dev/null
sudo mkdir -p "$CG_PATH"
sudo chown -R $USER:$USER "$CG_PATH"
echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
# 2. Prepare Data Files
dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
sync
echo 3 > /proc/sys/vm/drop_caches
# 3. Start Workload (Working Set)
(
echo $BASHPID > "$CG_PATH/cgroup.procs"
exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
--zonemode=strided --zonesize=4K --zonerange=64K \
--time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
--fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
) &
WORKLOAD_PID=$!
# 4. Waiting for hot data to warm up
sleep 30
BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
# 5. Running workload for 60second
sleep 60
# 6. Report refault and IO bandwidth
FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
echo "File Refault Delta is $FINAL_D_FILE"
kill $WORKLOAD_PID 2>/dev/null
sleep 2
grep -E "READ|WRITE" fio.stats \
| awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
rm -f hot_data.bin fio.stats
Without the patch, we observed 12883855 file refaults and a very low
bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
hot positions, continuously pushing out the real working set and
causing incorrect reclaim. With the patch, we observed 0 refaults
and bandwidth increased to 5078 MiB/s.
Note that this patch does not benefit any platform other than arm64,
since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
ptes"") reverted the change that made prefault PTEs “old”, after it
was identified as the cause of a ~6% regression in UnixBench on x86.
This was due to reports that x86 uses an internal microfault mechanism
for HW AF. The hardware access flag mechanism is relatively expensive
and can lead to a ~6% UnixBench regression when prefaulted PTEs are
not marked young directly in the page fault path, especially when
UnixBench runs without any memory pressure[2].
Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
faultaround mappings on arm64 with hardware access flag” [3].
This is also thanks to arm64 microarchitectures, which incur zero cost
for HW AF handling.
It may be time for x86 and other architectures to revisit
whether HW AF is truly costly on their platforms, given that
the original x86 regression was reported 10 years ago.
For those who want to try the model on x86, you will need the
following in arch/x86/include/asm/pgtable.h.
#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
static inline bool arch_wants_old_prefaulted_pte(void)
{
return true;
}
Lance and Xueyuan made a huge contribution to this patch
through testing. They truly worked over weekends and after
work hours. If this patch deserves any credit, it belongs to
them.
[1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
[2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
[3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@kernel.org/
Tested-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: wangzicheng <wangzicheng@honor.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lei Liu <liulei.rjpt@vivo.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
-rfc was:
[PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/
mm/swap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..e3cf703ccb89 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
/* see the comment in lru_gen_folio_seq() */
if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
- folio_set_active(folio);
+ folio_mark_accessed(folio);
folio_batch_add_and_move(folio, lru_add);
}
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-04-18 12:02 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox