Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Srinivas Aji <srinivas.aji@memverge.com>,
	David Hildenbrand <david@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	David Woodhouse <dwmw@amazon.com>,
	"Gowans, James" <jgowans@amazon.com>,
	Yue Li <yue.li@memverge.com>,
	Beau Beauchamp <beau.beauchamp@memverge.com>
Subject: Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
Date: Mon, 8 Aug 2022 16:05:44 -0700	[thread overview]
Message-ID: <62f196c86bec5_1b3c2945d@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <YvF+TWbB3BMxwxnK@memverge.com>

Srinivas Aji wrote:
> On Fri, Aug 05, 2022 at 02:46:26PM +0200, David Hildenbrand wrote:
> > Can you explain how "zero copy snapshots of processes" would work, both
> > 
> > a) From a user space POV
> > b) From a kernel-internal POV
> > 
> > Especially, what I get is that you have a filesystem on that memory
> > region, and all memory that is not used for filesystem blocks can be
> > used as ordinary system RAM (a little like shmem, but restricted to dax
> > memory regions?).
> > 
> > But how does this interact with zero-copy snapshots?
> > 
> > I feel like I am missing one piece where we really need system RAM as
> > part of the bigger picture. Hopefully it's not some hack that converts
> > system RAM to file system blocks :)
> 
> My proposal probably falls into this category. The idea is that if we
> have the persistent filesystem in the same space as system RAM, we
> could make most of the process pages part of a snapshot file by
> holding references to the these pages and making the pages
> copy-on-write for the process, in about the same way a forked child
> would. (I still don't have this piece fully worked out. May be there
> are reasons why this won't work or will make something else difficult,
> and that is why you are advising against it.)

If I understand the proposal correctly I think you eventually run into
situations similar to what killed RDMA+FSDAX support. The filesystem
needs to be the ultimate arbiter of the physical address space and this
solution seems to want to put part of that control in an agent outside
of the filesystem.

> Regarding the userspace and kernel POV:
> 
> The userspace operation would be that the process tries to save or
> restore its pages using vmsplice(). In the kernel, this would be
> implemented using a filesystem which shares pages with system RAM and
> uses a zero-copy COW mechanism for those process pages which can be
> shared with the filesystem.
> 
> I had earlier been thinking of having a different interface to the
> kernel, which creates a file with only those memory pages which can be
> saved using COW and also indicates to the caller which pages have
> actually been saved.  But having a vmsplice implementation which does
> COW as far as possible keeps the userspace process indicating the
> desired function (saving or restoring memory pages) and the kernel
> implementation handling the zero copy as an optimization where
> possible.

While my initial reaction to hearing about this proposal back at LSF
indeed made it sound like an extension to FSDAX semantics, now I am not
so sure. This requirement you state, "...we have to get the blocks
through the memory allocation API, at an offset not under our control"
makes me feel like this is a new memory management facility where the
application thinks it is getting page allocations serviced via the
typical malloc+mempolicy APIs, but another agent is positioned to trap
and service those requests.

Correct me if I am wrong, but is the end goal similar to what an
application in a VM experiences when that VM's memory is backed by a
file mappping on the VMM side? I.e. the application is accessing a
virtual NUMA node, but the faults into physical address space are
trapped and serviced by the VMM. If that is the case then the solution
starts look more like NUMA "namespacing" than a block-device + file
interface. In other words a rough (I mean rough) strawman like:

   numactlX --remap=3,/dev/dax0.0 --membind=3 $application

Where memory allocation and refault requests can be trapped by that
modified numactl. As far as the application is concerned its memory
policy is set to allocate from NUMA node 3, and those page allocation
requests are routed to numactlX via userfaultfd-like mechanics to map
pages out of /dev/dax0.0 (or any other file for that mattter). Snap
shotting would be achieved by telling numactlX to CoW all of the pages
that it currently has mapped while the snapshot is taken.

     prev parent reply	other threads:[~2022-08-08 23:06 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-02 17:57 Srinivas Aji
2022-08-02 18:02 ` [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated Srinivas Aji
2022-08-02 18:03 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
2022-08-02 18:53   ` Srinivas Aji
2022-08-02 18:07 ` [RFC PATCH 2/4] device-dax: Add framework for keeping persistent data in DAX KMEM Srinivas Aji
2022-08-02 18:10 ` [RFC PATCH 3/4] device-dax: Add a NONE type for DAX KMEM persistence Srinivas Aji
2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
2022-08-03 21:19   ` Fabio M. De Francesco
2022-08-05 12:46 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
2022-08-08 21:21   ` Srinivas Aji
2022-08-08 23:05     ` Dan Williams [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=62f196c86bec5_1b3c2945d@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=beau.beauchamp@memverge.com \
    --cc=david@redhat.com \
    --cc=dwmw@amazon.com \
    --cc=jgowans@amazon.com \
    --cc=linux-mm@kvack.org \
    --cc=srinivas.aji@memverge.com \
    --cc=vgoyal@redhat.com \
    --cc=yue.li@memverge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox