Re: [PATCH v2 0/9] Add support for shared PTEs across processes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Khalid Aziz <khalid.aziz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Cc: willy@infradead.org, aneesh.kumar@linux.ibm.com, arnd@arndb.de,
	21cnbao@gmail.com, corbet@lwn.net, dave.hansen@linux.intel.com,
	ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz,
	keescook@chromium.org, kirill@shutemov.name, kucharsk@gmail.com,
	linkinjeon@kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	longpeng2@huawei.com, luto@kernel.org, markhemm@googlemail.com,
	pcc@google.com, rppt@kernel.org, sieberf@amazon.com,
	sjpark@amazon.de, surenb@google.com, tst@schoebel-theuer.de,
	yzaikin@google.com
Subject: Re: [PATCH v2 0/9] Add support for shared PTEs across processes
Date: Mon, 18 Jul 2022 14:59:46 +0200	[thread overview]
Message-ID: <1d3cdad0-b3a7-ec25-1652-efa7c39d1705@redhat.com> (raw)
In-Reply-To: <bca034e9-5218-5ae4-79df-8c40e0aa6d3d@oracle.com>

[sorry for not being as responsive as I usually am]

>>
>> They share a *mm* including a consistent virtual memory layout (VMA
>> list). Page table sharing is just a side product of that. You could even
>> call page tables just an implementation detail to produce that
>> consistent virtual memory layout -- described for that MM via a
>> different data structure.
> 
> Yes, sharing an mm and vma chain does make it different from implementation point of view.
> 
>>
>>> A number of people have commented on potential usefulness of this concept
>>> and implementation.
>>
>> ... and a lot of people raised concerns. Yes, page table sharing to
>> reduce memory consumption/tlb misses/... is something reasonable to
>> have. But that doesn't require mshare, as hugetlb has proven.
>>
>> The design might be useful for a handful of corner (!) cases, but as the
>> cover letter only talks about memory consumption of page tables, I'll
>> not care about those. Once these corner cases are explained and deemed
>> important, we might want to think of possible alternatives to explore
>> the solution space.
> 
> Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a 
> customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into 
> their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for 
> every one.

Yes. Another use case I am aware of are KVM-based virtual machines, when
VM memory (shmem, file-backed) is not only mapped into the emulator
process, but also into other processes used to carry out I/O (e.g.,
vhost-user).

In that case, it's tempting to simply share the page tables between all
processes for the shared mapping -- automatically, just like
shmem/hugetlb already does.

[...]

>>
>>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
>>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
>>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
>>> wrong.
>>
>> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
>> and asked the honest question if we can just remove it and replace it by
>> something generic in the future. And as I learned, we most probably
>> cannot rip that out without affecting existing user space. Even
>> replacing it by mshare() would degrade existing user space.
>>
>> So the natural thing to reduce page table consumption (again, what this
>> cover letter talks about) for user space (semi- ?)automatically for
>> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
>> code to cache and reuse page tables (PTE and PMD tables should be
>> sufficient) where suitable.
>>
>> For reasonably aligned mappings and mapping sizes, it shouldn't be too
>> hard (I know, locking ...), to cache and reuse page tables attached to
>> files -- similar to what hugetlb does, just in a generic way. We might
>> want a mechanism to enable/disable this for specific processes and/or
>> VMAs, but these are minor details.
>>
>> And that could come for free for existing user space, because page
>> tables, and how they are handled, would just be an implementation detail.
>>
>>
>> I'd be really interested into what the major roadblocks/downsides
>> file-based page table sharing has. Because I am not convinced that a
>> mechanism like mshare() -- that has to be explicitly implemented+used by
>> user space -- is required for that.
>>
> 
> I see two parts to what you are suggesting (please correct me if I get this wrong):
> 
> 1. Implement a generic page table sharing mechanism
> 2. Implement a way to use this mechanism from userspace

Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X
MiB -> start sharing), with an additional way to just disable it or just
enable it. But yes, most of that stuff should just be automatic.

> 
> For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is 
> to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from 
> this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing 
> code and recasting it into this framework of special mm struct. I will look some more into it.

The basic idea would be that whenever a MAP_SHARED VMA has a reasonable
size, is aligned in a suitable way (including MAP offset), and
protection match, you can just share PTE tables and even PMD tables. As
page tables of shared mappings usually don't really store per-process
information (exceptions I am aware of are userfaultfd and softdirty
tracking), we can simply share/unshare page tables of shared mappings
fairly easily.

Then, you'd have e.g., 2 sets of page tables cached by the fd that can
be mapped into processes

1) PROT_READ|PROT_WRITE
2) PROT_READ

On VMA protection changes, one would have to unshare (unmap the page
table) and either map another shared one, or map a private one. I don't
think there would be need to optimize e.g., for PROT_NONE, but of
course, other combinations could make sense to cache.

PROT_NONE and other corner cases (softdirty tracking) would simply not
use shared page tables.

Shared page tables would have to be refcounted and one could e.g.,
implement a shrinker that frees unused page tables in the fd cache when
memory reclaim kicks in.

With something like that in place, page table consumption could be
reduced and vmscan/rmap walks could turn out more efficient.

> 
> As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file 
> based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal 
> for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you 
> can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction.

We can glue it to the file or anything else that's shared I think  -- I
don't think we particularly, as long as it's something shared between
processes to be mapped. And to be quite honest, whenever I read about
anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming:
just use *shared* memory when you want to *share* memory between
processes, and optimize that if anything is missing.

Having that said, I understood from previous discussions that there is a
use case of efficient read-only protection across many processes/VMAs. I
was wondering if that could be handled on the fs-level (pte_mkwrite). I
remember I raised the idea before: if one could have a
userfaultfd-wp-style (overlay?) file (system?), user-space could
protect/unprotect file pages via a different mechanism (ioctl) and get
notified about write access via something similar to userfaultfd
user-space handlers, not via signals. Instead of adjusting VMAs, once
would only adjust file page mappings to map the relevant pages R/O when
protecting -- if page tables are shared, that would be efficient.

Now, that is is just a very vague brain dump to get it out of my
(overloaded) system. What I think the overall message is: let's try not
designing new features around page table sharing, let's use page table
sharing as an rmap performance optimization and as a mechanism to reduce
page table overhead. I hope what I said makes any sense, I might eb just
wrong.

-- 
Thanks,

David / dhildenb

     prev parent reply	other threads:[~2022-07-18 12:59 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 22:53 Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
2022-06-30 21:53   ` Darrick J. Wong
2022-07-01 16:05     ` Khalid Aziz
2022-06-30 22:57   ` Al Viro
2022-07-01 16:08     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
2022-06-30 21:37   ` Darrick J. Wong
2022-06-30 22:54     ` Khalid Aziz
2022-06-30 23:01   ` Al Viro
2022-07-01 16:11     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
2022-06-30 21:34   ` Darrick J. Wong
2022-06-30 22:49     ` Khalid Aziz
2022-06-30 23:09   ` Al Viro
2022-07-02  0:22     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
2022-06-30 21:27   ` Darrick J. Wong
2022-06-30 22:27     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-06-30 14:59   ` Mark Hemment
2022-06-30 15:46     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
2022-06-30 21:44   ` Darrick J. Wong
2022-06-30 23:30     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
2022-06-30 21:50   ` Darrick J. Wong
2022-07-01 15:58     ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
2022-07-07  9:13   ` Xin Hao
2022-07-07 15:33     ` Khalid Aziz
2022-06-29 22:54 ` [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes Khalid Aziz
2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
2022-06-30 15:39   ` Khalid Aziz
2022-07-02  4:24 ` Andrew Morton
2022-07-06 19:26   ` Khalid Aziz
2022-07-08 11:47   ` David Hildenbrand
2022-07-08 19:36     ` Khalid Aziz
2022-07-13 14:00       ` David Hildenbrand
2022-07-13 17:58         ` Mike Kravetz
2022-07-13 18:03           ` David Hildenbrand
2022-07-14 22:02         ` Khalid Aziz
2022-07-18 12:59           ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d3cdad0-b3a7-ec25-1652-efa7c39d1705@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=arnd@arndb.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiederm@xmission.com \
    --cc=hagen@jauu.net \
    --cc=jack@suse.cz \
    --cc=keescook@chromium.org \
    --cc=khalid.aziz@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=kucharsk@gmail.com \
    --cc=linkinjeon@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=luto@kernel.org \
    --cc=markhemm@googlemail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=pcc@google.com \
    --cc=rppt@kernel.org \
    --cc=sieberf@amazon.com \
    --cc=sjpark@amazon.de \
    --cc=surenb@google.com \
    --cc=tst@schoebel-theuer.de \
    --cc=willy@infradead.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox