Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Khalid Aziz <khalid.aziz@oracle.com>
To: David Hildenbrand <david@redhat.com>, lsf-pc@lists.linux-foundation.org
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
Date: Mon, 4 Mar 2024 09:45:02 -0700	[thread overview]
Message-ID: <83515478-f064-4d28-9125-748860e61a48@oracle.com> (raw)
In-Reply-To: <fec23c89-8263-4526-bb0e-66abe2b2bd5c@redhat.com>

On 2/29/24 02:21, David Hildenbrand wrote:
> On 28.02.24 23:56, Khalid Aziz wrote:
>> Threads of a process share address space and page tables that allows for
>> two key advantages:
>>
>> 1. Amount of memory required for PTEs to map physical pages stays low
>> even when large number of threads share the same pages since PTEs are
>> shared across threads.
>>
>> 2. Page protection attributes are shared across threads and a change
>> of attributes applies immediately to every thread without any overhead
>> of coordinating protection bit changes across threads.
>>
>> These advantages no longer apply when unrelated processes share pages.
>> Large database applications can easily comprise of 1000s of processes
>> that share 100s of GB of pages. In cases like this, amount of memory
>> consumed by page tables can exceed the size of actual shared data.
>> On a database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA even
>> though the system had 512GB of memory. On this server, in the worst case
>> scenario of all 1500 processes mapping every page from SGA would have
>> required 878GB+ for just the PTEs.
>>
>> I have sent proposals and patches to solve this problem by adding a
>> mechanism to the kernel for processes to use to opt into sharing
>> page tables with other processes. We have had discussions on original
>> proposal and subsequent refinements but we have not converged on a
>> solution. As systems with multi-TB memory and in-memory databases
>> are becoming more and more common, this is becoming a significant issue.
>> An interactive discussion can help us reach a consensus on how to
>> solve this.
> 
> Hi,
> 
> I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be 
> "no follow-up discussion".
> 
> Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss 
> requirements and possible adjustments to the original concept to get something possibly cleaner.
> 
> For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why 
> they are all bad would be helpful ;)
> 
> [1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
> 

Hi David,

That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and 
maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive 
the implementation.

On 11/2/23 14:25, David Hildenbrand wrote:
 > On 01.11.23 23:40, Khalid Aziz wrote:
 >> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
 >> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
 >
 > ... and everyone has to get the fault and mprotect() again,
 >
 > Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
 >
 > You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
 > again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
 > infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
 >

My understanding of requirement from database applications is they want to create a large shared memory region for 1000s 
of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves 
as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region 
(which can span thousands of pages), populate/update data and then close down write access to this region. Any other 
process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed. 
All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that 
with 1000s of processes and let other processes deal with access having been closed. With this requirement, what 
database applications has found to be effective is to use mprotect() to apply protection to the part of shared region 
and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that 
meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper 
request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no 
additional action required by 1000s of processes. That helps performance very significantly.

The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this 
way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I 
had described in my original mail where we needed more memory to store PTEs than installed on the system).

 >> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
 >> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
 >> same time. So the two requirements of this feature are not separable.
 >
 > Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
 > the current solution really requires sharing of page tables, which I absolutely don't like.
 >
 > It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
 > And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
 > with weird VMA semantics.

We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large 
multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close 
the gates and call it done.

Thanks,
Khalid

next prev parent reply	other threads:[~2024-03-04 16:45 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-28 22:56 Khalid Aziz
2024-02-29  9:21 ` David Hildenbrand
2024-02-29 14:12   ` Matthew Wilcox
2024-02-29 15:15     ` David Hildenbrand
2024-03-04 16:45   ` Khalid Aziz [this message]
2024-03-25 17:57     ` David Hildenbrand
2024-05-14 18:21 ` Christoph Lameter (Ampere)
2024-05-17 21:23   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83515478-f064-4d28-9125-748860e61a48@oracle.com \
    --to=khalid.aziz@oracle.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox