> > The result is as follows:
> >
> > Time_ms Nr_iteration_total Skip_addr_out_of_range Skip_mm_mismatch
> > Before: 228.65 22169 22168 0
> > After : 0.396 3 0 2
> >
> > The referenced reproducer of rmap_walk_ksm can be found at:
> > https://lore.kernel.org/all/20260206151424734QIyWL_pA-1QeJPbJlUxsO@zte.com.cn/
> >
> > Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
> > Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
> > Signed-off-by: xu xin <xu.xin16@zte.com.cn>
>
> This is a very attractive speedup, but I believe it's flawed: in the
> special case when a range has been mremap-moved, when its anon folio
> indexes and anon_vma pgoff correspond to the original user address,
> not to the current user address.
>
> In which case, rmap_walk_ksm() will be unable to find all the PTEs
> for that KSM folio, which will consequently be pinned in memory -
> unable to be reclaimed, unable to be migrated, unable to be hotremoved,
> until it's finally unmapped or KSM disabled.
>
> But it's years since I worked on KSM or on anon_vma, so I may be confused
> and my belief wrong. I have tried to test it, and my testcase did appear
> to show 7.0-rc6 successfully swapping out even mremap-moved KSM folios,
> but mm.git failing to do so.
Thank you very much for providing such detailed historical context. However,
I'm curious about your test case: how did you observe that KSM pages in mm.git
could not be swapped out, while 7.0-rc6 worked fine?
From the current implementation of mremap, before it succeeds, it always calls
prep_move_vma() -> madvise(MADV_UNMERGEABLE) -> break_ksm(), which splits KSM pages
into regular anonymous pages, which appears to be based on a patch you introduced
over a decade ago, 1ff829957316(ksm: prevent mremap move poisoning). Given this,
KSM pages should already be broken prior to the move, so they wouldn't remain as
mergeable pages after mremap. Could there be a scenario where this breaking mechanism
is bypassed, or am I missing a subtlety in the sequence of operations?
Thanks!
> However, I say "appear to show" because I
> found swapping out any KSM pages harder than I'd been expecting: so have
> some doubts about my testing. Let me give more detail on that at the
> bottom of this mail: it's a tangent which had better not distract from
> your speedup.
>
> If I'm right that your patch is flawed, what to do?
>
> Perhaps there is, or could be, a cleverer way for KSM to walk the anon_vma
> interval tree, which can handle the mremap-moved pgoffs appropriately.
> Cc'ing Michel, whose bf181b9f9d8d ("mm anon rmap: replace same_anon_vma
> linked list with an interval tree.") specifically chose the 0, ULONG_MAX
> which you are replacing.
>
> Cc'ing Lorenzo, who is currently considering replacing anon_vma by
> something more like my anonmm, which preceded Andrea's anon_vma in 2.6.7;
> but Lorenzo supplementing it with the mremap tracking which defeated me.
> This rmap_walk_ksm() might well benefit from his approach. (I'm not
> actually expecting any input from Lorenzo here, or Michel: more FYIs.)
>
> But more realistic in the short term, might be for you to keep your
> optimization, but fix the lookup, by keeping a count of PTEs found,
> and when that falls short, take a second pass with 0, ULONG_MAX.
> Somewhat ugly, certainly imperfect, but good enough for now.