From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91D9BCFB43F for ; Mon, 7 Oct 2024 08:41:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 290A76B00EE; Mon, 7 Oct 2024 04:41:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 219AE6B00EF; Mon, 7 Oct 2024 04:41:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 06DED6B00F0; Mon, 7 Oct 2024 04:41:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D09BE6B00EE for ; Mon, 7 Oct 2024 04:41:48 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6C062AC02D for ; Mon, 7 Oct 2024 08:41:48 +0000 (UTC) X-FDA: 82646163096.16.42D90B7 Received: from fhigh-a7-smtp.messagingengine.com (fhigh-a7-smtp.messagingengine.com [103.168.172.158]) by imf05.hostedemail.com (Postfix) with ESMTP id 62F0010000A for ; Mon, 7 Oct 2024 08:41:45 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="k zubhdt"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=MyWCzTNa; spf=pass (imf05.hostedemail.com: domain of kirill@shutemov.name designates 103.168.172.158 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728290371; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HW1/nAnoZg+IChobN7xn2IJtM/YKr7Y7ke8CNTMEY+M=; b=oySWjcbW9DJQGNTt7vvecZuB15YNr/GDllTffrI1dMrXa44bd07cIdATHO080zfR5uSL8A JoJxAOpwmPVG534se69XrAbmA7Lw06lrVFk+qO/memWUzkh+XsF0AypAM6nPIXpTPNI2Ho eltHuSaW99KtP3/1Zv/CAfb4g37Bzcg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728290371; a=rsa-sha256; cv=none; b=aYnoJYyJ32SuFe4YqRV7y5RsFlofl+zzTzZdY33U7heYGwL7OWhBDrF88qA+635ilUZrR7 NDI23F9uREvBYhYKlhUdmzdT/+Jrq4T9icXp/zzm4kNMUT3bW/+lpq+45w9cOpQrWtfx8w l2mPQW+/F23MQopqLF9hmUFPG5wZgo8= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm3 header.b="k zubhdt"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=MyWCzTNa; spf=pass (imf05.hostedemail.com: domain of kirill@shutemov.name designates 103.168.172.158 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none Received: from phl-compute-12.internal (phl-compute-12.phl.internal [10.202.2.52]) by mailfhigh.phl.internal (Postfix) with ESMTP id A45FF11400AB; Mon, 7 Oct 2024 04:41:44 -0400 (EDT) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-12.internal (MEProxy); Mon, 07 Oct 2024 04:41:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov.name; h=cc:cc:content-type:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm3; t=1728290504; x= 1728376904; bh=HW1/nAnoZg+IChobN7xn2IJtM/YKr7Y7ke8CNTMEY+M=; b=k zubhdtVs6m2be1vjrm0Iv4AhLP3u/bU2Hds2JxoESNdhvf0Ff1MBZBI9HTzLxlth tPtlGxQ9M98pZakjDbelKKRPmdv9HGU454yYxntDhuhOKwlszSOwXJ7Vuz7r3f4I Hq5k1E9NFNHubpu3R421HBQhNg1PWMO2SVZyKnkCEdwvQSmVoC1BbYqpn7CRZzR1 YRRv4KbVfIqCf09DMzrjWZe9CkNrnlxSg3V8g55NOA6mVZhA3CNP8A44MsIcRbhl M49yGzNljbrDa+9q3F/h5x2RX+z1GFKO97J2qdg7rl7ox3HirQ+IflLLReFd0QUd FKkmBq0ae+yiVh3k0Ya6g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; t=1728290504; x=1728376904; bh=HW1/nAnoZg+IChobN7xn2IJtM/YK r7Y7ke8CNTMEY+M=; b=MyWCzTNatMPWz/TlQmyjHa/b2i/wB9c2FdNz3+ISrDuH m2gh0ywkKDXCqPQM0ZJbCisZ8GWfpGzPzyK7TrLyKZxZCJFthIIS03dWQXH3OHWZ 2PIwubmfNuqY/ezi+h7QBq5SvnOyfuIL9Jm7mi67eGLBXUw6lXjebi3V4K/YAP7v uH2WygXalGmapiTTHOAsRLMWzq3QkbxZRGAdcgIzbxWcPg2gqWvYVlf1EQEFIRsG KJT2hxGUc5+Tc6KMKDsSMKANxQSpZt3upjj4MdT4EcbJfGgNtKE8+r79drcGCIey Q/aujLiMkxwE3jjdojsMaGak5JfK1QwkFFlhIXmpsg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddrvddvledgtdejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdggtfgfnhhsuhgsshgtrhhisggvpdfu rfetoffkrfgpnffqhgenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnh htshculddquddttddmnecujfgurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvden ucfhrhhomhepfdfmihhrihhllhcutedrucfuhhhuthgvmhhovhdfuceokhhirhhilhhlse hshhhuthgvmhhovhdrnhgrmhgvqeenucggtffrrghtthgvrhhnpeffvdevueetudfhhfff veelhfetfeevveekleevjeduudevvdduvdelteduvefhkeenucevlhhushhtvghrufhiii gvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehkihhrihhllhesshhhuhhtvghmohhv rdhnrghmvgdpnhgspghrtghpthhtohepvdegpdhmohguvgepshhmthhpohhuthdprhgtph htthhopegrnhhthhhonhihrdihiihnrghgrgesohhrrggtlhgvrdgtohhmpdhrtghpthht oheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonhdrohhrghdprhgtphhtthhope ifihhllhihsehinhhfrhgruggvrggurdhorhhgpdhrtghpthhtohepmhgrrhhkhhgvmhhm sehgohhoghhlvghmrghilhdrtghomhdprhgtphhtthhopehvihhrohesiigvnhhivhdrlh hinhhugidrohhrghdruhhkpdhrtghpthhtohepuggrvhhiugesrhgvughhrghtrdgtohhm pdhrtghpthhtohepkhhhrghlihgusehkvghrnhgvlhdrohhrghdprhgtphhtthhopegrnh gurhgvhihknhhvlhesghhmrghilhdrtghomhdprhgtphhtthhopegurghvvgdrhhgrnhhs vghnsehinhhtvghlrdgtohhm X-ME-Proxy: Feedback-ID: ie3994620:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 7 Oct 2024 04:41:35 -0400 (EDT) Date: Mon, 7 Oct 2024 11:41:30 +0300 From: "Kirill A. Shutemov" To: Anthony Yznaga Cc: akpm@linux-foundation.org, willy@infradead.org, markhemm@googlemail.com, viro@zeniv.linux.org.uk, david@redhat.com, khalid@kernel.org, andreyknvl@gmail.com, dave.hansen@intel.com, luto@kernel.org, brauner@kernel.org, arnd@arndb.de, ebiederm@xmission.com, catalin.marinas@arm.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhiramat@kernel.org, rostedt@goodmis.org, vasily.averin@linux.dev, xhao@linux.alibaba.com, pcc@google.com, neilb@suse.de, maz@kernel.org Subject: Re: [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support Message-ID: References: <20240903232241.43995-1-anthony.yznaga@oracle.com> <20240903232241.43995-9-anthony.yznaga@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240903232241.43995-9-anthony.yznaga@oracle.com> X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 62F0010000A X-Stat-Signature: 1xj5mxnmyhx7ujn4x75b65qexbp57ejh X-HE-Tag: 1728290505-890375 X-HE-Meta: U2FsdGVkX195EwLCTrPG4xaGTR6+usdberXJuOjmrVjKuDJjIRtbpJ9vC+VqdaxP1g2tZFAXr6+2CpYNZ1JoN6riXWi1uVUtuUJU+C8l93uaqu6B7eBPypizAPLDrqOCU52FZXt1kMOerJ9emcLeo0bMN9D6b4A3gOxP0S5Mge19CvudIBygPkqHTUrirmL1MeNjnANeaFN6m1x4pTIXuNvMVEk7kc9+bMzkrR6IsLfHBnLx7C/luA9fZAVRR7tw/kpS+X668CKu1rIhBfDhf/1Yl+HhE+sukD8YHb26+qDHdI/kb97ITYjtivbxWB/rdQ5oIo2bc6iBJ92pQ5Ha1nW0RS1STZEPRG/5INYgmPIzn6zYauQGrL5QW14dABOhApPo0tH04KM2LYKcZgTDFLrtj+q2ouc8xmKL4ncahNQU7D/cHDIpdkbwUB3MVxUCEiT3BOdetIHzy1Xx0+etq4IXUnLWfuZWoR89nggz8QC4zW7W1j0vfCALbYgntIOByWRBIArY4vFRU/1N/dS+XoLIBbol4H7hH6//prOvOjJf2Mw8x+3diauVTSAnJZxY7sL5NmDDJXs8tY4T6oHBotOtkE+lJ/8IyLEkm07pWs9Mm1rKEFrTI+16YMLUmcqOmgNdrx1VVrLA93SLTv0+G58FTG0APkzsyQ8Gkd4EVat439oRS2tL4SH6OD6S2SoIlzFnBLrPGrzrEAhsd2F/R74XHs57ywqShsoQk2IDyGA5b4gfDCOh+OZPd+uaywb5sY5pe5ZW4JUlw678X4e7mfxhbYYT1EzCeCP8M8WPOOIS/IGerzEWd8JfA3ELSOZ/CnMz9bR6B875lmX7S+pFUUPVr1WMo95vnng0e1b5Og5lvLG5XVrxS9oiFBeRtrAYWfF0VVdbXpFVphZTZ8eVAiJqfeKPOi/r3xuMKoXiIzKOZ0VaiijFS1y5jwAJGlVwFqX1du8oYsXXRsiIvZm 1q6JB+1w yV4B8lSnCluG4q0ocADlsCluCTqslAvkd+HwfrimD1/W5VRY1l3YV0E6DFBxTgC9amq2Xpb4UoP0ALs1j8tayYbbxGP0W5iOL8rvGIKQIL5IqXOjzEU8OJfJOds8VcPVEFuprZ9SjSoZC2IuesYWvN/gp2YD1x5IOqBiOott8xKOU7MdG80THt9gwMdo+8uAZn6fPrSC6ZAJuGzJkmMgpKDXjHOYdBxGN4IIZ0csa1uNmJ1DrhzFbBCGTM/UCTY7CQlwVzVJuEEBnQXDFlID1hRRI8ueR/p9Hhy/fSnaCAL8xSjt8uh6Eqer4vIxrdYE8SJ5sVqxphYDPZyvVyg/HZ/R6uFluZQlTQ9I73x4MaRiF+brT7uz5CpU4nRyHLgMwe++t8WZ3kvfDjD1kgHZH1c3Zwp9EzbutQR73W4gFmQ8vBI3vgc4pzAEKBmirfvAt9+Xc1p4qgcLw2NIvi/FpBOcTIsWCJKG4r3HO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 03, 2024 at 04:22:39PM -0700, Anthony Yznaga wrote: > From: Khalid Aziz > > Add support for handling page faults in an mshare region by > redirecting the faults to operate on the mshare mm_struct and > vmas contained in it and to link the page tables of the faulting > process with the shared page tables in the mshare mm. > Modify the unmap path to ensure that page tables in mshare regions > are kept intact when a process exits. Note that they are also > kept intact and not unlinked from a process when an mshare region > is explicitly unmapped which is bug to be addressed. > > Signed-off-by: Khalid Aziz > Signed-off-by: Matthew Wilcox (Oracle) > Signed-off-by: Anthony Yznaga > --- > mm/internal.h | 1 + > mm/memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++--- > mm/mshare.c | 38 +++++++++++++++++++++++++++++++ > 3 files changed, 98 insertions(+), 3 deletions(-) > > diff --git a/mm/internal.h b/mm/internal.h > index 8005d5956b6e..8ac224d96806 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -1578,6 +1578,7 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); > void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); > void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); > > +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp); > static inline bool vma_is_shared(const struct vm_area_struct *vma) > { > return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT); > diff --git a/mm/memory.c b/mm/memory.c > index 3c01d68065be..f526aef71a61 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -387,11 +387,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > vma_start_write(vma); > unlink_anon_vmas(vma); > > + /* > + * There is no page table to be freed for vmas that > + * are mapped in mshare regions > + */ > if (is_vm_hugetlb_page(vma)) { > unlink_file_vma(vma); > hugetlb_free_pgd_range(tlb, addr, vma->vm_end, > floor, next ? next->vm_start : ceiling); > - } else { > + } else if (!vma_is_shared(vma)) { > unlink_file_vma_batch_init(&vb); > unlink_file_vma_batch_add(&vb, vma); > > @@ -399,7 +403,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > * Optimization: gather nearby vmas into one call down > */ > while (next && next->vm_start <= vma->vm_end + PMD_SIZE > - && !is_vm_hugetlb_page(next)) { > + && !is_vm_hugetlb_page(next) > + && !vma_is_shared(next)) { > vma = next; > next = mas_find(mas, ceiling - 1); > if (unlikely(xa_is_zero(next))) > @@ -412,7 +417,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, > unlink_file_vma_batch_final(&vb); > free_pgd_range(tlb, addr, vma->vm_end, > floor, next ? next->vm_start : ceiling); > - } > + } else > + unlink_file_vma(vma); > + > vma = next; I would rather have vma->vm_ops->free_pgtables() hook that would be defined to non-NULL for mshared and hugetlb VMAs > } while (vma); > } > @@ -1797,6 +1804,13 @@ void unmap_page_range(struct mmu_gather *tlb, > pgd_t *pgd; > unsigned long next; > > + /* > + * No need to unmap vmas that share page table through > + * mshare region > + */ > + if (vma_is_shared(vma)) > + return; > + Ditto. vma->vm_ops->unmap_page_range(). > BUG_ON(addr >= end); > tlb_start_vma(tlb, vma); > pgd = pgd_offset(vma->vm_mm, addr); > @@ -5801,6 +5815,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > struct mm_struct *mm = vma->vm_mm; > vm_fault_t ret; > bool is_droppable; > + bool shared = false; > > __set_current_state(TASK_RUNNING); > > @@ -5808,6 +5823,21 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (ret) > goto out; > > + if (unlikely(vma_is_shared(vma))) { > + /* XXX mshare does not support per-VMA locks yet so fallback to mm lock */ > + if (flags & FAULT_FLAG_VMA_LOCK) { > + vma_end_read(vma); > + return VM_FAULT_RETRY; > + } > + > + ret = find_shared_vma(&vma, &address); > + if (ret) > + return ret; > + if (!vma) > + return VM_FAULT_SIGSEGV; > + shared = true; Do we need to update 'mm' variable here? It is going to be used to account the fault below. Not sure which mm has to account such faults. > + } > + > if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, > flags & FAULT_FLAG_INSTRUCTION, > flags & FAULT_FLAG_REMOTE)) { > @@ -5843,6 +5873,32 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > if (is_droppable) > ret &= ~VM_FAULT_OOM; > > + /* > + * Release the read lock on the shared mm of a shared VMA unless > + * unless the lock has already been released. > + * The mmap lock will already have been released if __handle_mm_fault > + * returns VM_FAULT_COMPLETED or if it returns VM_FAULT_RETRY and > + * the flags FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT are > + * _not_ both set. > + * If the lock was released earlier, release the lock on the task's > + * mm now to keep lock state consistent. > + */ > + if (shared) { > + int release_mmlock = 1; > + > + if ((ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) == 0) { > + mmap_read_unlock(vma->vm_mm); > + release_mmlock = 0; > + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) && > + (flags & FAULT_FLAG_RETRY_NOWAIT)) { > + mmap_read_unlock(vma->vm_mm); > + release_mmlock = 0; > + } > + > + if (release_mmlock) > + mmap_read_unlock(mm); > + } > + > if (flags & FAULT_FLAG_USER) { > mem_cgroup_exit_user_fault(); > /* > diff --git a/mm/mshare.c b/mm/mshare.c > index f3f6ed9c3761..8f47c8d6e6a4 100644 > --- a/mm/mshare.c > +++ b/mm/mshare.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include "internal.h" > > struct mshare_data { > struct mm_struct *mm; > @@ -33,6 +34,43 @@ struct msharefs_info { > static const struct inode_operations msharefs_dir_inode_ops; > static const struct inode_operations msharefs_file_inode_ops; > > +/* Returns holding the host mm's lock for read. Caller must release. */ > +vm_fault_t > +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp) > +{ > + struct vm_area_struct *vma, *guest = *vmap; > + struct mshare_data *m_data = guest->vm_private_data; > + struct mm_struct *host_mm = m_data->mm; > + unsigned long host_addr; > + pgd_t *pgd, *guest_pgd; > + > + mmap_read_lock(host_mm); Hm. So we have current->mm locked here, right? So this is nested mmap lock. Have you tested it under lockdep? I expected it to complain. > + host_addr = *addrp - guest->vm_start + host_mm->mmap_base; > + pgd = pgd_offset(host_mm, host_addr); > + guest_pgd = pgd_offset(guest->vm_mm, *addrp); > + if (!pgd_same(*guest_pgd, *pgd)) { > + set_pgd(guest_pgd, *pgd); > + mmap_read_unlock(host_mm); > + return VM_FAULT_NOPAGE; > + } > + > + *addrp = host_addr; > + vma = find_vma(host_mm, host_addr); > + > + /* XXX: expand stack? */ > + if (vma && vma->vm_start > host_addr) > + vma = NULL; > + > + *vmap = vma; > + > + /* > + * release host mm lock unless a matching vma is found > + */ > + if (!vma) > + mmap_read_unlock(host_mm); > + return 0; > +} > + > /* > * Disallow partial unmaps of an mshare region for now. Unmapping at > * boundaries aligned to the level page tables are shared at could > -- > 2.43.5 > -- Kiryl Shutsemau / Kirill A. Shutemov