linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Pedro Falcato <pfalcato@suse.de>
To: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	andreyknvl@gmail.com,  arnd@arndb.de, bp@alien8.de,
	brauner@kernel.org, bsegall@google.com,  corbet@lwn.net,
	dave.hansen@linux.intel.com, david@redhat.com,
	 dietmar.eggemann@arm.com, ebiederm@xmission.com, hpa@zytor.com,
	jakub.wartak@mailbox.org,  jannh@google.com,
	juri.lelli@redhat.com, khalid@kernel.org,
	 liam.howlett@oracle.com, linyongting@bytedance.com,
	lorenzo.stoakes@oracle.com,  luto@kernel.org,
	markhemm@googlemail.com, maz@kernel.org, mhiramat@kernel.org,
	 mgorman@suse.de, mhocko@suse.com, mingo@redhat.com,
	muchun.song@linux.dev,  neilb@suse.de, osalvador@suse.de,
	pcc@google.com, peterz@infradead.org,  rostedt@goodmis.org,
	rppt@kernel.org, shakeel.butt@linux.dev, surenb@google.com,
	 tglx@linutronix.de, vasily.averin@linux.dev, vbabka@suse.cz,
	 vincent.guittot@linaro.org, viro@zeniv.linux.org.uk,
	vschneid@redhat.com,  willy@infradead.org, x86@kernel.org,
	xhao@linux.alibaba.com,  linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
Subject: Re: [PATCH v3 01/22] mm: Add msharefs filesystem
Date: Wed, 10 Sep 2025 13:14:13 +0100	[thread overview]
Message-ID: <do7cmy4eiiqd5ux62r3u2ghizc62ljg5m3mqx7qzy3im4kc2p6@upmigdbp7eat> (raw)
In-Reply-To: <20250820010415.699353-2-anthony.yznaga@oracle.com>

On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
> From: Khalid Aziz <khalid@kernel.org>
> 
> Add a pseudo filesystem that contains files and page table sharing
> information that enables processes to share page table entries.
> This patch adds the basic filesystem that can be mounted, a
> CONFIG_MSHARE option to enable the feature, and documentation.
> 
> Signed-off-by: Khalid Aziz <khalid@kernel.org>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  Documentation/filesystems/index.rst    |  1 +
>  Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
>  include/uapi/linux/magic.h             |  1 +
>  mm/Kconfig                             | 11 +++
>  mm/Makefile                            |  4 ++
>  mm/mshare.c                            | 97 ++++++++++++++++++++++++++
>  6 files changed, 210 insertions(+)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
> 
> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
> index 11a599387266..dcd6605eb228 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -102,6 +102,7 @@ Documentation for filesystem implementations.
>     fuse-passthrough
>     inotify
>     isofs
> +   msharefs
>     nilfs2
>     nfs/index
>     ntfs3
> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
> new file mode 100644
> index 000000000000..3e5b7d531821
> --- /dev/null
> +++ b/Documentation/filesystems/msharefs.rst
> @@ -0,0 +1,96 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +Msharefs - A filesystem to support shared page tables
> +=====================================================
> +
> +What is msharefs?
> +-----------------
> +
> +msharefs is a pseudo filesystem that allows multiple processes to
> +share page table entries for shared pages. To enable support for
> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
> +
> +msharefs is typically mounted like this::
> +
> +	mount -t msharefs none /sys/fs/mshare
> +
> +A file created on msharefs creates a new shared region where all
> +processes mapping that region will map it using shared page table
> +entries. Once the size of the region has been established via
> +ftruncate() or fallocate(), the region can be mapped into processes
> +and ioctls used to map and unmap objects within it. Note that an
> +msharefs file is a control file and accessing mapped objects within
> +a shared region through read or write of the file is not permitted.
> +

Welp. I really really don't like this API.
I assume this has been discussed previously, but why do we need a new
magical pseudofs mounted under some random /sys directory?

But, ok, assuming we're thinking about something hugetlbfs like, that's not too
bad, and programs already know how to use it.

> +How to use mshare
> +-----------------
> +
> +Here are the basic steps for using mshare:
> +
> +  1. Mount msharefs on /sys/fs/mshare::
> +
> +	mount -t msharefs msharefs /sys/fs/mshare
> +
> +  2. mshare regions have alignment and size requirements. Start
> +     address for the region must be aligned to an address boundary and
> +     be a multiple of fixed size. This alignment and size requirement
> +     can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
> +     which returns a number in text format. mshare regions must be
> +     aligned to this boundary and be a multiple of this size.
> +

I don't see why size and alignment needs to be taken into consideration by
userspace. You can simply establish a mapping and pad it out.

> +  3. For the process creating an mshare region:
> +
> +    a. Create a file on /sys/fs/mshare, for example::
> +
> +        fd = open("/sys/fs/mshare/shareme",
> +                        O_RDWR|O_CREAT|O_EXCL, 0600);

Ok, makes sense.

> +
> +    b. Establish the size of the region::
> +
> +        fallocate(fd, 0, 0, BUF_SIZE);
> +
> +      or::
> +
> +        ftruncate(fd, BUF_SIZE);
> +

Yep.

> +    c. Map some memory in the region::
> +
> +	struct mshare_create mcreate;
> +
> +	mcreate.region_offset = 0;
> +	mcreate.size = BUF_SIZE;
> +	mcreate.offset = 0;
> +	mcreate.prot = PROT_READ | PROT_WRITE;
> +	mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> +	mcreate.fd = -1;
> +
> +	ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);

Why?? Do you want to map mappings in msharefs files, that can themselves be
mapped? Why do we need an ioctl here?

Really, this feature seems very overengineered. If you want to go the fs route,
doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
Or properly doing a syscall that can try to attach the page-table-sharing
property to random VMAs.

But I'm wholly opposed to the idea of "mapping a file that itself has more
mappings, mappings which you establish using a magic filesystem and ioctls".

-- 
Pedro


  parent reply	other threads:[~2025-09-10 12:14 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-20  1:03 [PATCH v3 00/22] Add support for shared PTEs across processes Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 01/22] mm: Add msharefs filesystem Anthony Yznaga
2025-09-08 18:29   ` Liam R. Howlett
2025-09-08 19:09     ` Anthony Yznaga
2025-09-10 12:14   ` Pedro Falcato [this message]
2025-09-10 12:46     ` David Hildenbrand
2025-08-20  1:03 ` [PATCH v3 02/22] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 03/22] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 04/22] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 05/22] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 06/22] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
2025-09-08 18:45   ` David Hildenbrand
2025-09-08 18:56     ` Anthony Yznaga
2025-09-08 19:02       ` David Hildenbrand
2025-09-08 19:03         ` Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 07/22] mm/mshare: Add mmap support Anthony Yznaga
2025-08-20 19:02   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 08/22] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 09/22] sched/numa: do not scan msharefs vmas Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 10/22] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 11/22] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
2025-08-21 15:40   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 12/22] mm: introduce PUD page table shared count Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 13/22] mm/mshare: prepare for page table sharing support Anthony Yznaga
2025-09-15 15:27   ` Lorenzo Stoakes
2025-08-20  1:04 ` [PATCH v3 14/22] x86/mm: enable page table sharing Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 15/22] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 16/22] mm: pass the mm in vma_munmap_struct Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 17/22] sched/mshare: mshare ownership Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 18/22] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
2025-08-20 20:36   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 19/22] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 20/22] mm/mshare: support mapping files and anon hugetlb " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 21/22] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 22/22] mm/mshare: charge fault handling allocations to the mshare owner Anthony Yznaga
2025-09-08 18:50   ` David Hildenbrand
2025-09-08 19:21     ` Anthony Yznaga
2025-09-08 20:28       ` David Hildenbrand
2025-09-08 20:55         ` Anthony Yznaga
2025-09-08 20:32 ` [PATCH v3 00/22] Add support for shared PTEs across processes David Hildenbrand
2025-09-08 20:59   ` Matthew Wilcox
2025-09-08 21:14     ` Anthony Yznaga
2025-09-09  7:53       ` David Hildenbrand
2025-09-09 18:29         ` Anthony Yznaga
2025-09-09 19:06         ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=do7cmy4eiiqd5ux62r3u2ghizc62ljg5m3mqx7qzy3im4kc2p6@upmigdbp7eat \
    --to=pfalcato@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jakub.wartak@mailbox.org \
    --cc=jannh@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=khalid@kernel.org \
    --cc=liam.howlett@oracle.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyongting@bytedance.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=markhemm@googlemail.com \
    --cc=maz@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=neilb@suse.de \
    --cc=osalvador@suse.de \
    --cc=pcc@google.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=vasily.averin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox