linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kiryl Shutsemau <kas@kernel.org>
To: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	 Vlastimil Babka <vbabka@kernel.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	 Zi Yan <ziy@nvidia.com>, Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	Sean Christopherson <seanjc@google.com>,
	 Paolo Bonzini <pbonzini@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	kvm@vger.kernel.org,  James Houghton <jthoughton@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
Date: Tue, 14 Apr 2026 18:08:48 +0100	[thread overview]
Message-ID: <ad5hAVuRwa_0VNPf@thinkstation> (raw)
In-Reply-To: <ad5dIUpAMs4MuBvV@x1.local>

On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote:
> Hi, Kiryl,
> 
> On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> 
> Thanks for sharing this work, it looks very interesting to me.
> 
> Personally I am also looking at some kind of VMM memtiering issues.  I'm
> not sure if you saw my lsfmm proposal, it mentioned the challenge we're
> facing, it's slightly different but still a bit relevant:
> 
> https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/

Thanks will read up. I didn't follow userfultfd work until recently.

> Unfortunately, that proposal was rejected upstream.

Sorry about that. We can chat about in hall track, if you are there :)

> > == VMM Workflow ==
> 
> AFAIU, this workflow provides two functionalities:
> 
> > 
> >     UFFDIO_DEACTIVATE(all)            -- async, no vCPU stalls
> >     sleep(interval)
> >     PAGEMAP_SCAN                      -- find cold pages
> 
> Until here it's only about page hotness tracking.  I am curious whether you
> evaluated idle page tracking.  Is it because of perf overheads on rmap?

I didn't gave idle page tracking much thought. I needed uffd faults to
serialize reclaim against memory accesses. If use it for one thing we
can as well try to use it for tracking as well. And it seems to be
fitting together nicely with sync/async mode flipping.

> To
> me, your solution (until here.. on the hotness sampling) reads more like a
> more efficient way to do idle page tracking but only per-mm, not per-folio.
> 
> That will also be something I would like to benefit if QEMU will decide to
> do full userspace swap.  I think that's our last resort, I'll likely start
> with something that makes QEMU work together with Linux on swapping
> (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
> currently uses, as long as efficient) then QEMU only cares about the rest,
> which is what the migration problem is about.
> 
> The other issue about idle page tracking to us is, I believe MGLRU
> currently doesn't work well with it (due to ignoring IDLE bits) where the
> old LRU algo works.  I'm not sure how much you evaluated above, so it'll be
> great to share from that perspective too.  I also mentioned some of these
> challenges in the lsfmm proposal link above.
> 
> >     UFFDIO_SET_MODE(sync)             -- block faults for eviction
> >     pwrite + MADV_DONTNEED cold pages -- safe, faults block
> >     UFFDIO_SET_MODE(async)            -- resume tracking
> 
> These operations are the 2nd function.  It's, IMHO, a full userspace swap
> system based on userfaultfd.

Right. And we want to decide where to put cold pages from userspace. 

> Have you thought about directly relying on userfaultfd-wp to do this work?
> The relevant question is, why do we need to block guest reads on pages
> being evicted by the userapp?  Can we still allow that to happen, which
> seems to be more efficient?  IIUC, only writes / updates matters in such
> swap system.

But we do care about about read accesses. We don't want to swap out
pages that got read-touched. And we cannot in practice switch to WP mode
after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls
with TLB flushing each.

With my approach switching tracking and reclaiming is single bit flip
under mmap lock.

> Also, I'm not sure if you're aware of LLNL's umap library:
> 
> https://github.com/llnl/umap
> 
> That implemnted the swap system using userfaultfd wr-protect mode only, so
> no new kernel API needed.

Will look into it. Thanks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


  reply	other threads:[~2026-04-14 17:09 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-14 14:23 Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 01/12] userfaultfd: define UAPI constants for anonymous minor faults Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 02/12] userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 03/12] userfaultfd: implement UFFDIO_DEACTIVATE ioctl Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 04/12] userfaultfd: UFFDIO_CONTINUE for anonymous memory Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 05/12] mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 06/12] userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async mode Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 07/12] sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 08/12] userfaultfd: enable UFFD_FEATURE_MINOR_ANON Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 09/12] mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 10/12] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 11/12] selftests/mm: add userfaultfd anonymous minor fault tests Kiryl Shutsemau (Meta)
2026-04-14 14:23 ` [RFC, PATCH 12/12] Documentation/userfaultfd: document working set tracking Kiryl Shutsemau (Meta)
2026-04-14 15:28 ` [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory Peter Xu
2026-04-14 17:08   ` Kiryl Shutsemau [this message]
2026-04-14 17:45     ` Peter Xu
2026-04-14 15:37 ` David Hildenbrand (Arm)
2026-04-14 17:10   ` Kiryl Shutsemau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad5hAVuRwa_0VNPf@thinkstation \
    --to=kas@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox