From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBFC0C001DC for ; Mon, 31 Jul 2023 12:25:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A069280037; Mon, 31 Jul 2023 08:25:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2290D280023; Mon, 31 Jul 2023 08:25:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CA76280037; Mon, 31 Jul 2023 08:25:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E9BFD280023 for ; Mon, 31 Jul 2023 08:25:17 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B116714023B for ; Mon, 31 Jul 2023 12:25:17 +0000 (UTC) X-FDA: 81071827074.27.2AEBCF5 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf21.hostedemail.com (Postfix) with ESMTP id C70EB1C0019 for ; Mon, 31 Jul 2023 12:25:15 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=P7N0wvz5; dmarc=none; spf=none (imf21.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690806316; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0K40KC1hdcWabwndSZwBHaRcP6z+xWm7XvI+3DJvaVw=; b=P6k792Hahq1+zlU37K5N0jkax7nduumVc1jcL030hDQpRx61z6Ec1uavo9c/vn1rVmAd71 kuapTEZ+9xvyirOJOYPGA+nQH+8fU/EmKG6RiZddehauMShK8eMrtRiV08TJqAh4i8ZroH 7pFT2MBPWW2f/XysDlJZdVxMjg9hG+Q= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=P7N0wvz5; dmarc=none; spf=none (imf21.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690806316; a=rsa-sha256; cv=none; b=7YNt/GqUQ7rWTJGx+Q5jKV5bQEunpRpzYctIiL+mpaFWXNf8jGM7ncJ9t0hnQd8HxvKzis v4P7JTGnwc+GWmWVkt8wnVI5Qz6NjKTebvSiIgFVykZtrlv/oS0CKLni3Ls0nrpAH2PmEC ghiqWRUjNTcfb3XuhiT5vxoaynUPwjQ= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=0K40KC1hdcWabwndSZwBHaRcP6z+xWm7XvI+3DJvaVw=; b=P7N0wvz5C4yYBzFsluWGdPjpvI Dvk1WnbcF5p2q9WoM/YHj4im2eVmo/a9KjUjLvdY6748yAk7MIlkRnYQTx8e6mbVhNkgktPqAsJbI 9KQ5cWzxCS0yhuPReOfNgGEtsveNobYS9yIOIM57l0kVPOJPnIOnQXZub/mbUFWu++d7dKj7+1gIp zPS5BfnF11YPHQGBmj9ks705JgknptCWtWz9n7/ulq/ukkboit5OjhY5y49jRLVuzoLIbcyl+pW1k A6eO7VQkDh4hL2tGPfXA4OyM1wkDDHCbmb7YRlDsMVcUWn98p9EqQwCQNss1OlwB6SdPcI84CVLYP moZKMlsA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qQRxO-001bd3-EK; Mon, 31 Jul 2023 12:25:02 +0000 Date: Mon, 31 Jul 2023 13:25:02 +0100 From: Matthew Wilcox To: Rongwei Wang Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "xuyu@linux.alibaba.com" , David Hildenbrand Subject: Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Message-ID: References: <74fe50d9-9be9-cc97-e550-3ca30aebfd13@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <74fe50d9-9be9-cc97-e550-3ca30aebfd13@linux.alibaba.com> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C70EB1C0019 X-Stat-Signature: biu9f6gq6rkcckoaas7nki1hnuetpk4x X-HE-Tag: 1690806315-683148 X-HE-Meta: U2FsdGVkX19T3Rj5X8pPpEED8sh1HmbJQ+Dtg+gAAkkLsyJjPIa3CSXqm5J9DNVUQ3gGOJ6AmGN062aUTU6yIwAkz1OcFN6ddQClP55Atk1rWeDs9TVV2PEyxAXNmqmrcO510QjtVs//mt6sJPTNTuLnzBdRBe1J4mnyPv/wRUUQ75Iqrek4FTfUz54sPgoOuVDVjxYRSTI91FtvP+1BVt/JJIgJOwFt6lmtIY5M2L7Pi0te/uneKFUJvwy4D77XYq2Pd5QpJnqQvejfieprHu948E9OvfzPh8Nj6mvDEe7dQLGKly/Pufe/fKrvnM/ryUcBu5tB84wT6I7ndQuHWmfDe+1U4OcGmVilKyt6Gze8876Yy/V+vhJBjfzBe76K1B7cjemEIVrzs4sOAWpgUDb7Gv4p1ecRdsZESOY6FTdyiIqKN+t51Uo8M/Bh47/X12WXTzZfl8Zja8243DyPLd453E1FFi8GP/AdK3N3HK29O8NPUK3ezNMSb3xJ3L4F9rOZQq4pWtNylMwU0aBYKQ8HfxU7PLGGZ6Fwrczu7ATNDvRhM2CLBuf2VqQdoXUVKpPnEH2DYhUaEV7UL+zZvNrGRluVCQebwRtiljL5z2+SzFfdqWVagGcjhFAjzM/hBU5JinHd3BSsnIBEW40bsSH1daYjJOVdEYXEbenOSFE6hPGLd1I/E1iB4FkQHtgAihEnUgIx0n93XscYfppbYHOvTU9WWNxTGoTZOdPuTNkxUc09vZPyknl3IXJpky+dZHbdETbx3gPylUJ3DydHy22vRoprJjhZ8O2Szyj6uT53sjVk7lu9YVkEnRODw8fa6C31gtixPNlPPS60wZvnWHU7vhbT0wHL3WSXle6PmmSblI0GVUQV3AQlXTU1FMkSW3cUUrAiMLsO+fqPTkSybi+V8Pwyo0Vn4reyBLabhYViJyg9H/DmEmdJo4I8+lMwisH9kmDn7AxEHEr58QJ lj9RHVN6 lRDm2PUioNx7C5iuVUrTD3xgBAaESDdAmXfNh9GHtDFDMhDlIdOOmCTtxkFPHFLQNWNUHCo6g47ZwVMBkW4GVudOgngYCvfMYIt6QHOFlpc2kXds6X9gJxT0XPNgV4RCaRZ2J0Vlm+v4lOxxiHIODHtAwJmUR+Y92LnWr7IJx6N+qnbt+181s5DrJ7Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote: > Hi Matthew > > May I ask you another question about mshare under this RFC? I remember you > said you will redesign the mshare to per-vma not per-mapping (apologize if > remember wrongly) in last time MM alignment session. And I also refer to you > to re-code this part in our internal version (based on this RFC). It seems > that per VMA will can simplify the structure of pgtable sharing, even > doesn't care the different permission of file mapping. these are advantages > (maybe) that I can imagine. But IMHO, It seems not a strongly reason to > switch per-mapping to per-vma. > > And I can't imagine other considerations of upstream. Can you share the > reason why redesigning in a per-vma way, due to integation with hugetlbfs > pgtable sharing or anonymous page sharing? It was David who wants to make page table sharing be per-VMA. I think he is advocating for the wrong approach. In any case, I don't have time to work on mshare and Khalid is on leave until September, so I don't think anybody is actively working on mshare. > Thanks for your time. > > On 2023/4/27 00:49, Khalid Aziz wrote: > > Memory pages shared between processes require a page table entry > > (PTE) for each process. Each of these PTE consumes some of the > > memory and as long as number of mappings being maintained is small > > enough, this space consumed by page tables is not objectionable. > > When very few memory pages are shared between processes, the number > > of page table entries (PTEs) to maintain is mostly constrained by > > the number of pages of memory on the system. As the number of > > shared pages and the number of times pages are shared goes up, > > amount of memory consumed by page tables starts to become > > significant. This issue does not apply to threads. Any number of > > threads can share the same pages inside a process while sharing the > > same PTEs. Extending this same model to sharing pages across > > processes can eliminate this issue for sharing across processes as > > well. > > > > Some of the field deployments commonly see memory pages shared > > across 1000s of processes. On x86_64, each page requires a PTE that > > is only 8 bytes long which is very small compared to the 4K page > > size. When 2000 processes map the same page in their address space, > > each one of them requires 8 bytes for its PTE and together that adds > > up to 8K of memory just to hold the PTEs for one 4K page. On a > > database server with 300GB SGA, a system crash was seen with > > out-of-memory condition when 1500+ clients tried to share this SGA > > even though the system had 512GB of memory. On this server, in the > > worst case scenario of all 1500 processes mapping every page from > > SGA would have required 878GB+ for just the PTEs. If these PTEs > > could be shared, amount of memory saved is very significant. > > > > This patch series adds a new flag to mmap() call - MAP_SHARED_PT. > > This flag can be specified along with MAP_SHARED by a process to > > hint to kernel that it wishes to share page table entries for this > > file mapping mmap region with other processes. Any other process > > that mmaps the same file with MAP_SHARED_PT flag can then share the > > same page table entries. Besides specifying MAP_SHARED_PT flag, the > > processes must map the files at a PMD aligned address with a size > > that is a multiple of PMD size and at the same virtual addresses. > > This last requirement of same virtual addresses can possibly be > > relaxed if that is the consensus. > > > > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct > > is created to hold the shared page tables. Host mm struct is not > > attached to a process. Start and size of host mm are set to the > > start and size of the mmap region and a VMA covering this range is > > also added to host mm struct. Existing page table entries from the > > process that creates the mapping are copied over to the host mm > > struct. All processes mapping this shared region are considered > > guest processes. When a guest process mmap's the shared region, a vm > > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page > > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the > > flag is found, its corresponding PMD is updated with the PMD from > > host mm struct so the PMD will point to the page tables in host mm > > struct. vm_mm pointer of the VMA is also updated to point to host mm > > struct for the duration of fault handling to ensure fault handling > > happens in the context of host mm struct. When a new PTE is > > created, it is created in the host mm struct page tables and the PMD > > in guest mm points to the same PTEs. > > > > This is a basic working implementation. It will need to go through > > more testing and refinements. Some notes and questions: > > > > - PMD size alignment and size requirement is currently hard coded > > in. Is there a need or desire to make this more flexible and work > > with other alignments/sizes? PMD size allows for adapting this > > infrastructure to form the basis for hugetlbfs page table sharing > > as well. More work will be needed to make that happen. > > > > - Is there a reason to allow a userspace app to query this size and > > alignment requirement for MAP_SHARED_PT in some way? > > > > - Shared PTEs means mprotect() call made by one process affects all > > processes sharing the same mapping and that behavior will need to > > be documented clearly. Effect of mprotect call being different for > > processes using shared page tables is the primary reason to > > require an explicit opt-in from userspace processes to share page > > tables. With a transparent sharing derived from MAP_SHARED alone, > > changed effect of mprotect can break significant number of > > userspace apps. One could work around that by unsharing whenever > > mprotect changes modes on shared mapping but that introduces > > complexity and the capability to execute a single mprotect to > > change modes across 1000's of processes sharing a mapped database > > is a feature explicitly asked for by database folks. This > > capability has significant performance advantage when compared to > > mechanism of sending messages to every process using shared > > mapping to call mprotect and change modes in each process, or > > using traps on permissions mismatch in each process. > > > > - This implementation does not allow unmapping page table shared > > mappings partially. Should that be supported in future? > > > > Some concerns in this RFC: > > > > - When page tables for a process are freed upon process exit, > > pmd_free_tlb() gets called at one point to free all PMDs allocated > > by the process. For a shared page table, shared PMDs can not be > > released when a guest process exits. These shared PMDs are > > released when host mm struct is released upon end of last > > reference to page table shared region hosted by this mm. For now > > to stop PMDs being released, this RFC introduces following change > > in mm/memory.c which works but does not feel like the right > > approach. Any suggestions for a better long term approach will be > > very appreciated: > > > > @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb, > > pud_t *pud, > > > > pmd = pmd_offset(pud, start); > > pud_clear(pud); > > - pmd_free_tlb(tlb, pmd, start); > > - mm_dec_nr_pmds(tlb->mm); > > + if (shared_pte) { > > + tlb_flush_pud_range(tlb, start, PAGE_SIZE); > > + tlb->freed_tables = 1; > > + } else { > > + pmd_free_tlb(tlb, pmd, start); > > + mm_dec_nr_pmds(tlb->mm); > > + } > > } > > > > static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, > > > > - This implementation requires an additional VM flag. Since all lower > > 32 bits are currently in use, the new VM flag must come from upper > > 32 bits which restricts this feature to 64-bit processors. > > > > - This feature is implemented for file mappings only. Is there a > > need to support it for anonymous memory as well? > > > > - Accounting for MAP_SHARED_PT mapped filepages in a process and > > pagetable bytes is not quite accurate yet in this RFC and will be > > fixed in the non-RFC version of patches. > > > > I appreciate any feedback on these patches and ideas for > > improvements before moving these patches out of RFC stage. > > > > > > Changes from RFC v1: > > - Broken the patches up into smaller patches > > - Fixed a few bugs related to freeing PTEs and PMDs incorrectly > > - Cleaned up the code a bit > > > > > > Khalid Aziz (4): > > mm/ptshare: Add vm flag for shared PTE > > mm/ptshare: Add flag MAP_SHARED_PT to mmap() > > mm/ptshare: Create new mm struct for page table sharing > > mm/ptshare: Add page fault handling for page table shared regions > > > > include/linux/fs.h | 2 + > > include/linux/mm.h | 8 + > > include/trace/events/mmflags.h | 3 +- > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/Makefile | 2 +- > > mm/internal.h | 21 ++ > > mm/memory.c | 105 ++++++++-- > > mm/mmap.c | 88 +++++++++ > > mm/ptshare.c | 263 +++++++++++++++++++++++++ > > 9 files changed, 476 insertions(+), 17 deletions(-) > > create mode 100644 mm/ptshare.c > >