[LSF/MM/BPF TOPIC] Userspace-driven memory tiering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: [LSF/MM/BPF TOPIC] Userspace-driven memory tiering
Date: Tue, 10 Feb 2026 15:52:07 -0500	[thread overview]
Message-ID: <aYuad2k75iD9bnBE@x1.local> (raw)

Hi,

I would like to propose a topic to discuss userspace-driven memory tiering.

Note, that neither the subject nor contents are stablized; currently, the
only thing that is for certain is the use case.  The hope is below content
will define well on the topic, and especially, the use case to be discussed
here to collect feedbacks.

I'll try to make the topic more condensed and focused if this will be
selected and discussed on the conference.  If anyone thinks below is too
much for one topic, I am open to spit it into two or more.

It's also possible I make stupid mistakes below as I didn't code anything
up or test yet so I may overlook things, please kindly bare with me if so.

Problem
=======

Here, I'm not yet looking for anything more complex than two tiers.  The
use case can be as simple as: one process, having only a portion of its
memory serviced by fast devices (like DRAM), the rest serviced by slow
devices (e.g. SSDs). In a VM use case, it allows a host to service more VMs
with over provisioning.

With the help of memcg and MGLRU, Linux swap system can do this well
enough, except that it misses one pillar stone of VM where we want to be
flexible enough to move VMs around all over the cluster.

When that happens, the hypervisor needs to scan the VM pages one by one and
copy them to a peer host in a busy loop.  Here, by the nature of swap
transparency, the userspace will need to fault in all the cold pages to
fetch the data for moving.  Meanwhile, after migration the hotness
information is also lost because of such "transparency", because we must
apply those data on top of RAMs first.

In this use case, memcg is almost great to service multiple needs.
However, it is still coarse grained from some aspects: it allows specify
swap usage, whilst it doesn't yet allow to specify swap device to use for a
process, or IOPS allowed to be consumed on the swap devices.  This is less
of a concern in the whole picture, but would be nice to have.

Some possible solutions I would like to collect some inputs below.  NACKs
are more than welcomed, then it may also help to find the right path
acceptable to everyone.

I'll start with the solution that I think might be the most efficient and
straightforward, and I'll only try to discuss the solutions from the kernel
perspective.  The last solution will be fully userspace-implemented.  I'll
only mention it; there's nothing we need to change from Linux POV.

Possible Solutions
==================

(1) Backend-aware Swap Data Access

To solve the major problem above, we want to know if there is way an
userspace can directly access a swap device but without causing it to be
faulted, polluting hotness information (in case of MGLRU, on generations or
tierings), or consuming DRAM / causing folio allocations while doing so.

Considering we have mincore(2) system call, would it be possible we can
provide a similar syscall, besides knowing "whether the page is resident in
RAM", also access the data on the back when it's a swap?

(1.a) New syscall swap_access()

  swap_access(addr, len, flags, *vec, *buffer)

  addr:   start virtual address of the range
  len:    len of the virtual address range
  flags:  operation flags (e.g. read / write for swap)
  vec:    an array containing pgtable info (e.g. is it a swap?)
  buffer: an array containing data buffers (for either read / write)

Examples:

When the userapp finds mincore() reports a swap entry, to read the data
instead of fault it into the mapping, it can bypass the mapping and issue:

  swap_access(addr, PSIZE, SWAP_OP_READ, vec[], buffer[])

It will check the page in the pgtable to see if it's a swap entry first, if
so, read from the swap backend, put the data into buffer[0], setting
SWAP_FL_READ_OK in vec[0] saying it's a swap entry and read successfully.
It doesn't touch the pgtable and keep the entry to be a swap entry.

OTOH, when the userapp knows some data is cold (but still useful), and want
to populate some data directly from swap without allocating folios at all,
one can use:

  swap_access(addr, PSIZE, SWAP_OP_WRITE, vec[], buffer[])

It will first check if the pgtable is empty and not allocated, if so,
allocate a swap entry, put the data (in buffer[0]) to swap device, then set
vec[0] to SWAP_FL_WRITE_OK saying data populated.  The pgtable (after
syscall returns) should have one swap entry populated without any folio
being allocated.

NOTE: due to the transparency, there might be race conditions on
swap_access() v.s. the page being swapped in/out on the fly.  We can either
make the swap_access() be serialized with those, or directly fail those
swap_access() saying "concurrent access / -EBUSY".  Normally it means some
page are being promoted to hotter tiers, hence failure to userspace would
be fine; it implies to the userspace that this is a hot page now and it can
directly access from DRAM.

Both anonymous and shmem support should suffice in this regard.  We could
really start from one of them, say, anonymous, if this would ever be
anything useful.

(1.b) Genuine O_DIRECT support for shmem

Shmem supports O_DIRECT since Hugh's commit e88e0d366f9cf ("tmpfs: trivial
support for direct IO") in 2023 (v6.6+).  At that time, it was only for
easier testing purpose, and I believe there's no real use case.  Maybe we
can re-define this API so that O_DIRECT means "read/write to swap"?

It means reads/writes will need to be 4K-aligned with shmem O_DIRECT, all
operations happen directly on swap devices without updating the page cache
with real folios.  We'll likely need to properly serialize concurrent
accesses but it's fine; when doing O_DIRECT it means this shmem page is
already cold, so slower is OK.

It also means we can't easily support anonymous use this method,
unfortunately.

(2) Hotness Information API

One step back, if above solution won't work out for whatever reason, it may
mean the userapp needs to implement the swap on its own to be able to
access the backends directly.  Then, there's still chance we can share the
information on page hotness with the kernel's vmscan logic.  Here as MGLRU
seems to be a better candidate now, I'll focus on it.

MGLRU by default doesn't work well with idle page tracking.  It's likely
because nobody should be using idle page tracking when MGLRU is
present.. However if an userapp needs to manage swap for one single process
for whatever reason, we may want to allow most of the host run with MGLRU,
however for the specific process it will manage swap on its own.  The
single process may still need page hotness info.

Then the question is: can this process still be able to share page hotness
information with the kernel, so that we don't need idle page tracking
(which will almost stop working well with MGLRU enabled)?

It means allows per-page / per-folio reporting of MGLRU hotness information
on either generations and tiers.  One way we can do it is via pagemap or
similar interface.  Would this be acceptable?

To make things further, consider if we move a cold page from one host to
another, then we want to apply the page with the same hotness alongside
with the data to be applied: would we allow a reverse operation of such so
that we can provide hotness hint to kernel from userspace?  E.g., consider
ioctl(UFFDIO_COPY) with a gen+tier information attached, so when the new
folio is atomically created and populated, it will be put into proper
gen+tier.

(3) Fully Userspace Swap Implementation

This will almost always work with no kernel change needed, with the help of
userfaultfd.  I'll skip all details whatever to happen in an userapp.

Except that we may still need idle page tracking in this case if above (2)
will not be accepted upstream.. so either we may want to conditionally
enable idle page tracking with a CONFIG_ option (so distro can opt-in
enabling idle page tracking together with MGLRU), or somehow allow MGLRU to
work properly with this specific process knowing it may use idle page
tracking.

-- 
Peter Xu

                 reply	other threads:[~2026-02-10 20:52 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYuad2k75iD9bnBE@x1.local \
    --to=peterx@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox