Re: [RFC PATCH 0/6] Add support for shared PTEs across processes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mark Hemment <markhemm@googlemail.com>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	longpeng2@huawei.com, arnd@arndb.de,
	 dave.hansen@linux.intel.com, david@redhat.com, rppt@kernel.org,
	 Suren Baghdasaryan <surenb@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
Date: Wed, 19 Jan 2022 11:38:37 +0000	[thread overview]
Message-ID: <CANe_+UhrcQO0mwB2KJtm69+MDVJuD-r5Cbyqt8AkWdtfY55CCQ@mail.gmail.com> (raw)
In-Reply-To: <cover.1642526745.git.khalid.aziz@oracle.com>

On Tue, 18 Jan 2022 at 21:20, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API Proposal
> ============
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
>
>         if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr %lx \n",
>                                 mshare_info[1], mshare_info[0]);
>         else
>                 perror("read failed");
>
>         close(fd);
>
>         addr = (char *)mshare_info[0];
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0],
>                         mshare_info[1], O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------
...
Hi Khalid,

The proposed mshare() appears to be similar to POSIX shared memory,
but with two extra (related) attributes;
a) Internally, uses shared page tables.
b) Shared memory is mapped at same address for all users.

Rather than introduce two new system calls, along with /sys/ file to
communicate global addresses, could mshare() be built on top of shmem
API?  Thinking of something like the below;
1) For shm_open(3), add a new oflag to indicate the properties needed
for mshare() (say, O_SHARED_PTE - better name?)
2) For ftruncate(2), objects created with O_SHARED_PTE are constrained
in the sizes which can be set.
3) For mmap(2), NULL is always passed as the address for O_SHARED_PTE
objects.  On first mmap()ing an appropiate address is assigned,
otherwise the current 'global' address is used.
4) shm_unlink(3) destroys the object when last reference is dropped.

For 3), might be able to weaken the NULL requirement and validate a
given address on first mapping to ensure it is correctly aligned.
shm_open(3) sets FD_CLOEXEC on the file descriptor, which might not be
the default behaviour you require.

Internally, the handling of mshare()/O_SHARED_PTE memory might be
sufficiently different to shmem that there is not much code sharing
between the two (I haven't thought this through, but the object
naming/refcounting should be similiar), but using shmem would be a
familiar API.

Any thoughts?

Cheers,
Mark

next prev parent reply	other threads:[~2022-01-19 11:38 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-18 21:19 Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
2022-01-18 21:46   ` Matthew Wilcox
2022-01-18 22:47     ` Khalid Aziz
2022-01-18 22:06 ` Dave Hansen
2022-01-18 22:52   ` Khalid Aziz
2022-01-19 11:38 ` Mark Hemment [this message]
2022-01-19 17:02   ` Khalid Aziz
2022-01-20 12:49     ` Mark Hemment
2022-01-20 19:15       ` Khalid Aziz
2022-01-24 15:15         ` Mark Hemment
2022-01-24 15:27           ` Matthew Wilcox
2022-01-24 22:20           ` Khalid Aziz
2022-01-21  1:08 ` Barry Song
2022-01-21  2:13   ` Matthew Wilcox
2022-01-21  7:35     ` Barry Song
2022-01-21 14:47       ` Matthew Wilcox
2022-01-21 16:41         ` Khalid Aziz
2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-01-22  1:41             ` Matthew Wilcox
2022-01-22 10:18               ` Thomas Schoebel-Theuer
2022-01-22 16:09                 ` Matthew Wilcox
2022-01-22 11:31 ` Mike Rapoport
2022-01-22 18:29   ` Andy Lutomirski
2022-01-24 18:48   ` Khalid Aziz
2022-01-24 19:45     ` Andy Lutomirski
2022-01-24 22:30       ` Khalid Aziz
2022-01-24 23:16         ` Andy Lutomirski
2022-01-24 23:44           ` Khalid Aziz
2022-01-25 11:42 ` Kirill A. Shutemov
2022-01-25 12:09   ` William Kucharski
2022-01-25 13:18     ` David Hildenbrand
2022-01-25 14:01       ` Kirill A. Shutemov
2022-01-25 13:23   ` Matthew Wilcox
2022-01-25 13:59     ` Kirill A. Shutemov
2022-01-25 14:09       ` Matthew Wilcox
2022-01-25 18:57         ` Kirill A. Shutemov
2022-01-25 18:59           ` Matthew Wilcox
2022-01-26  4:04             ` Matthew Wilcox
2022-01-26 10:16               ` David Hildenbrand
2022-01-26 13:38                 ` Matthew Wilcox
2022-01-26 13:55                   ` David Hildenbrand
2022-01-26 14:12                     ` Matthew Wilcox
2022-01-26 14:30                       ` David Hildenbrand
2022-01-26 14:12                   ` Mike Rapoport
2022-01-26 13:42               ` Kirill A. Shutemov
2022-01-26 14:18                 ` Mike Rapoport
2022-01-26 17:33                   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANe_+UhrcQO0mwB2KJtm69+MDVJuD-r5Cbyqt8AkWdtfY55CCQ@mail.gmail.com \
    --to=markhemm@googlemail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox