Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>,
	lsf-pc@lists.linux-foundation.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
Date: Thu, 29 Feb 2024 16:15:18 +0100	[thread overview]
Message-ID: <0a63c084-7507-42fd-9201-ab2ca8d38f6e@redhat.com> (raw)
In-Reply-To: <ZeCQs34DQgu_kxQu@casper.infradead.org>

On 29.02.24 15:12, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
>> On 28.02.24 23:56, Khalid Aziz wrote:
>>> Threads of a process share address space and page tables that allows for
>>> two key advantages:
>>>
>>> 1. Amount of memory required for PTEs to map physical pages stays low
>>> even when large number of threads share the same pages since PTEs are
>>> shared across threads.
>>>
>>> 2. Page protection attributes are shared across threads and a change
>>> of attributes applies immediately to every thread without any overhead
>>> of coordinating protection bit changes across threads.
>>>
>>> These advantages no longer apply when unrelated processes share pages.
>>> Large database applications can easily comprise of 1000s of processes
>>> that share 100s of GB of pages. In cases like this, amount of memory
>>> consumed by page tables can exceed the size of actual shared data.
>>> On a database server with 300GB SGA, a system crash was seen with
>>> out-of-memory condition when 1500+ clients tried to share this SGA even
>>> though the system had 512GB of memory. On this server, in the worst case
>>> scenario of all 1500 processes mapping every page from SGA would have
>>> required 878GB+ for just the PTEs.
>>>
>>> I have sent proposals and patches to solve this problem by adding a
>>> mechanism to the kernel for processes to use to opt into sharing
>>> page tables with other processes. We have had discussions on original
>>> proposal and subsequent refinements but we have not converged on a
>>> solution. As systems with multi-TB memory and in-memory databases
>>> are becoming more and more common, this is becoming a significant issue.
>>> An interactive discussion can help us reach a consensus on how to
>>> solve this.
>>
>> Hi,
>>
>> I was hoping for a follow-up to my previous comments from ~4 months ago [1],
>> so one problem of "not converging" might be "no follow-up discussion".
>>
>> Ideally, this session would not focus on mshare as previously discussed at
>> LSF/MM, but take a step back and discuss requirements and possible
>> adjustments to the original concept to get something possibly cleaner.
> 
> I think the concept is clean. 
> Your concept doesn't fit our use case!

Which one exactly are you talking about in particular?

I raised various alternatives/modifications for discussion, learning 
what works and what doesn't work on the way. (I never understood why 
protection on the pagecache level wouldn't work for your use case, but 
let's put that aside).

In my last mail, I had the following:

"
It's been a while, but I remember that the feedback in the room was
primarily that:
(a) the original mshare approach/implementation had a very dangerous
      smell to it. Rerouting mmap/mprotect/... is just absolutely nasty.
(b) that pure page table sharing itself might be itself a reasonable
      optimization worth having.

I still think generic page table sharing (as a pure optimization) can be
something reasonable to have, and can help existing use cases without
the need to modify any software (well, except maybe give a hint that it
might be reasonable).

As said, I see value in some fd-thingy that can be mmaped, but is
internally assembled from other fds (using protect ioctls, not mmap)
with sub-protection (using protect ioctls, not mprotect). The ioctls
would be minimal and clearly specified. Most madvise()/uffd/... would
simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting
of mmap, munmap, mprotect, ...

Under the hood, one can use a MM to manage all that and share page
tables. But it would be an implementation detail.
"

So I do think original mshare could be done "less scary" [1] by exposing 
a different, well defined and restricted interface to manage the 
"content" of mshare.

There is a lot of stuff to describe I have in mind, but it doesn't make 
sense to describe if it won't solve your usecase.

In my world it would end up cleaner, and naive me would have thought 
that you would enjoy something close to original mshare, just a bit less 
scary :)

> So essentially what you're asking for is for us to do a lot of work
> which doesn't solve our problem.  You can imagine our lack of enthusiasm
> for this.

I recall that implementing generic page table sharing is a lot of work 
that Oracle isn't interested in doing that, fair enough, I understood that.

Really, the amount of work is unclear if we don't talk about the actual 
solution.

I cannot really do more than offer help like I did:

   "I'm happy to discuss further. In a bi-weekly MM meeting, off-list or
    here.".

But if my comments are so unreasonable that they are not even worth 
discussing them, likely I wouldn't be of any help in another mshare session.

[1] https://lwn.net/Articles/895217/

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-02-29 15:15 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-28 22:56 Khalid Aziz
2024-02-29  9:21 ` David Hildenbrand
2024-02-29 14:12   ` Matthew Wilcox
2024-02-29 15:15     ` David Hildenbrand [this message]
2024-03-04 16:45   ` Khalid Aziz
2024-03-25 17:57     ` David Hildenbrand
2024-05-14 18:21 ` Christoph Lameter (Ampere)
2024-05-17 21:23   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a63c084-7507-42fd-9201-ab2ca8d38f6e@redhat.com \
    --to=david@redhat.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox