linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kalesh Singh <kaleshsingh@google.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	Anthony Yznaga <anthony.yznaga@oracle.com>,
	linux-mm@kvack.org,  akpm@linux-foundation.org,
	andreyknvl@gmail.com, arnd@arndb.de, bp@alien8.de,
	 brauner@kernel.org, bsegall@google.com, corbet@lwn.net,
	 dave.hansen@linux.intel.com, dietmar.eggemann@arm.com,
	ebiederm@xmission.com,  hpa@zytor.com, jakub.wartak@mailbox.org,
	jannh@google.com,  juri.lelli@redhat.com, khalid@kernel.org,
	liam.howlett@oracle.com,  linyongting@bytedance.com,
	lorenzo.stoakes@oracle.com, luto@kernel.org,
	 markhemm@googlemail.com, maz@kernel.org, mhiramat@kernel.org,
	mgorman@suse.de,  mhocko@suse.com, mingo@redhat.com,
	muchun.song@linux.dev, neilb@suse.de,  osalvador@suse.de,
	pcc@google.com, peterz@infradead.org, rostedt@goodmis.org,
	 rppt@kernel.org, shakeel.butt@linux.dev, surenb@google.com,
	 tglx@linutronix.de, vasily.averin@linux.dev, vbabka@suse.cz,
	 vincent.guittot@linaro.org, viro@zeniv.linux.org.uk,
	vschneid@redhat.com,  willy@infradead.org, x86@kernel.org,
	xhao@linux.alibaba.com,  linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,  linux-arch@vger.kernel.org,
	Isaac Manjarres <isaacmanjarres@google.com>,
	 "T.J. Mercier" <tjmercier@google.com>,
	android-mm <android-mm@google.com>
Subject: Re: [PATCH v3 00/22] Add support for shared PTEs across processes
Date: Thu, 26 Feb 2026 22:34:56 -0800	[thread overview]
Message-ID: <CAC_TJvdC+CSqvx+BvOv4gO2mJbwiBhb6OZO0sx=GXQ0CmA853g@mail.gmail.com> (raw)
In-Reply-To: <5tdailzxoywzzunbwhtlk4yjfmzunntniqtudkb52q6hib74ql@oq4mi226dedv>

On Thu, Feb 26, 2026 at 1:22 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Wed, Feb 25, 2026 at 03:06:10PM -0800, Kalesh Singh wrote:
> > On Tue, Feb 24, 2026 at 1:40 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> > >
> > > > I believe that managing a pseudo-filesystem (msharefs) and mapping via
> > > > ioctl during process creation could introduce overhead that impacts
> > > > app startup latency. Ideally, child apps shouldn't be aware of this
> > > > sharing or need to manage the pseudo-filesystem on their end.
> > > All process must be aware of these special semantics.
> > >
> > > I'd assume that fork() would simply replicate mshare region into the
> > > fork'ed child process. So from that point of view, it's "transparent" as
> > > in "no special mshare() handling required after fork".
> >
> > Hi David,
> >
> > That's agood  point. If fork() simply replicates the mshare region, it
> > does achieve transparency in terms of setup.
> >
> > I am still concerned about transparency in terms of observability.
> > Applications and sometimes inspect their own mappings (from
> > /proc/self/maps) to locate specific code or data regions for various
> > anti-tamper and obfuscation techniques. [2] If those mappings suddenly
> > point to an msharefs pseudo-file instead of the expected shared
> > library backing, it may break user-space assumptions and cause
> > compatibility issues.
>
> I'm not worried about transparency because this is not supposed to be
> transparent. This is not supposed to be used by most core system software.
> This is supposed to help replace hugetlb page table sharing.
>

Hi Pedro,

Thanks for the detailed breakdown.

Firstly let me state that my goal definitely isn't to derail or block
the current mshare efforts.  I'm mostly just trying to gather feedback
on what a "transparent", approach might actually look like.

> Transparent page table sharing has other constraints. I like the idea, in
> theory, but there are a number of constraints that make the idea unfeasible
> for now. There are a couple of problems we need to solve first:
>
> 1) Every spot where we modify PTEs needs to be assessed and use different
> helpers (that can un-cow page tables). Every pte_offset_map_lock() can now
> feasibly fail for OOM reasons (and that also needs to be assessed).
>

What if we strictly limit the scope to just read-only mappings being
shared? Would un-COWing still be necessary?

> 2) Various bits of PTE modification/unmapping now needs special care wrt TLB
> invalidation. The kernel needs to be aware of how the page tables are shared.
> I don't think the current rmap data structures are well suited to this kind
> of stuff (perhaps with Lorenzo's WIP anon rmap rework we'll get something
> better). Basically every spot that goes "modify PTE, flush TLB for mm" now
> needs to go "modify PTE, for every mm that maps this page table, flush $mm"
> (if you're thinking that COW will save us, it technically won't, or shouldn't,
> because of stuff like try_to_unmap_one() that is used in reclaim).

I think this bit might need to be architecture dependent. With shared
TLB partitioning on certain hardware, this becomes much less of an
issue. We could potentially gate this behind something like
CONFIG_ARCH_HAVE_SHARED_TLB_SUPPORT (or a similarly fitting name) so
only architectures that can handle the invalidation efficiently opt
in.

>
> 3) Reclaim loses even more information as now N processes share the same A
> bits. I don't know what effects this can cause. It would require
> experimentation. Perhaps something like "if page table is shared, value
> pte_young more". I don't know if this can work as a bandaid, but it's not
> ideal.

I agree this will require some experimentation. Intuitively, I like to
think these shared pages might naturally stay "hotter" since multiple
processes are accessing them concurrently, but we will definitely need
to experiment with the reclaim logic to see hwo ti does in practice.

>
> 4) It's not known whether page table COW fork() is a real win in most cases,
> or all cases. Would want measurement.

Our preliminary data on Android shows this can save ~200MB or more on
mobile devices right after boot. On memory-constrained client devices,
that is a significant win.

>
> 5) It becomes even harder to estimate RSS and PSS for each process.

For PSS (PAGE_SIZE / mapcount), I can see that a single mapcount from
all the processes mapping the page through the shared page table would
skew the result. Though, I find PSS not perfect already; I think
processes can artificially lower their PSS by mapping the same file
multiple times.

For RSS, I'm not sure I see the blockers to aggregating across the
private and shared mm_structs?

>
> For these reasons (and more, certainly), I don't think working mshare() into
> a transparent, all-great thing that fits the zygote model can work. It has been
> discussed at length how to pull off certain hard bits like TLB invalidation and
> locking for mshare, and with mshare we have the advantage of not needing to
> support every feature ever (tailoring it more to the big database users of
> hugetlb). And we'll still need to adapt certain bits of arch code just to get
> it to work efficiently.
>
> This said, if you want to discuss pulling this off, I'm all ears and it could
> be perhaps a fun discussion (too late for LSF, I guess), but I don't think
> it's workeable into the current mshare efforts. And, believe me, I would love
> a unified feature here :)

I saw Anthony proposed an mshare topic for LSF/MM; I hope to be there
as well, it would be great to chat about this in person.

Thanks,
Kalesh

>
> --
> Pedro


      reply	other threads:[~2026-02-27  6:35 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-20  1:03 Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 01/22] mm: Add msharefs filesystem Anthony Yznaga
2025-09-08 18:29   ` Liam R. Howlett
2025-09-08 19:09     ` Anthony Yznaga
2025-09-10 12:14   ` Pedro Falcato
2025-09-10 12:46     ` David Hildenbrand
2025-08-20  1:03 ` [PATCH v3 02/22] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 03/22] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 04/22] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 05/22] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 06/22] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
2025-09-08 18:45   ` David Hildenbrand
2025-09-08 18:56     ` Anthony Yznaga
2025-09-08 19:02       ` David Hildenbrand
2025-09-08 19:03         ` Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 07/22] mm/mshare: Add mmap support Anthony Yznaga
2025-08-20 19:02   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 08/22] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 09/22] sched/numa: do not scan msharefs vmas Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 10/22] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 11/22] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
2025-08-21 15:40   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 12/22] mm: introduce PUD page table shared count Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 13/22] mm/mshare: prepare for page table sharing support Anthony Yznaga
2025-09-15 15:27   ` Lorenzo Stoakes
2025-08-20  1:04 ` [PATCH v3 14/22] x86/mm: enable page table sharing Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 15/22] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 16/22] mm: pass the mm in vma_munmap_struct Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 17/22] sched/mshare: mshare ownership Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 18/22] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
2025-08-20 20:36   ` kernel test robot
2025-08-20  1:04 ` [PATCH v3 19/22] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 20/22] mm/mshare: support mapping files and anon hugetlb " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 21/22] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 22/22] mm/mshare: charge fault handling allocations to the mshare owner Anthony Yznaga
2025-09-08 18:50   ` David Hildenbrand
2025-09-08 19:21     ` Anthony Yznaga
2025-09-08 20:28       ` David Hildenbrand
2025-09-08 20:55         ` Anthony Yznaga
2025-09-08 20:32 ` [PATCH v3 00/22] Add support for shared PTEs across processes David Hildenbrand
2025-09-08 20:59   ` Matthew Wilcox
2025-09-08 21:14     ` Anthony Yznaga
2025-09-09  7:53       ` David Hildenbrand
2025-09-09 18:29         ` Anthony Yznaga
2025-09-09 19:06         ` Lorenzo Stoakes
2026-02-20 21:35 ` Kalesh Singh
2026-02-21 12:40   ` Pedro Falcato
2026-02-23 17:43     ` Kalesh Singh
2026-02-23 19:55       ` anthony.yznaga
2026-02-25 22:53         ` Kalesh Singh
2026-02-24  9:40   ` David Hildenbrand (Arm)
2026-02-25 23:06     ` Kalesh Singh
2026-02-26  9:02       ` David Hildenbrand (Arm)
2026-02-26 21:22       ` Pedro Falcato
2026-02-27  6:34         ` Kalesh Singh [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAC_TJvdC+CSqvx+BvOv4gO2mJbwiBhb6OZO0sx=GXQ0CmA853g@mail.gmail.com' \
    --to=kaleshsingh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=android-mm@google.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=isaacmanjarres@google.com \
    --cc=jakub.wartak@mailbox.org \
    --cc=jannh@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=khalid@kernel.org \
    --cc=liam.howlett@oracle.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyongting@bytedance.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=markhemm@googlemail.com \
    --cc=maz@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=neilb@suse.de \
    --cc=osalvador@suse.de \
    --cc=pcc@google.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=tjmercier@google.com \
    --cc=vasily.averin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox