From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BA00CD5BC5 for ; Thu, 5 Sep 2024 13:14:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E28318D000C; Thu, 5 Sep 2024 09:14:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD6C78D000A; Thu, 5 Sep 2024 09:14:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9EC88D000C; Thu, 5 Sep 2024 09:14:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A8FC78D000A for ; Thu, 5 Sep 2024 09:14:21 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4E42780288 for ; Thu, 5 Sep 2024 13:14:21 +0000 (UTC) X-FDA: 82530728322.22.CC5458C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf24.hostedemail.com (Postfix) with ESMTP id DF4B1180031 for ; Thu, 5 Sep 2024 13:14:18 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725541963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=icMkp1+njC2UQzIEvwKvnFHcqil0x4TbLB+nX3edvYE=; b=2WEVVMcsnTSu6slbYYO5UREonRCvvq2LMWPFZiHINzXkRXnxs0lC6PMxp2XI+bbQU/d93J 0zxip08iLbeKZY4wc8xXhh9NUNYZ5hcs5FIilYTmeDCEdNehu/YIKg1gk7XxGzY9qa7Iqv zRPGnCVlSpOAfIZjYGot7l0w22LejAc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725541963; a=rsa-sha256; cv=none; b=YU/7HU5wW9DAvcoU+NWPbV/PzJGDTKErUIeL2hehMyktI6jY/SdgdJl9NYZ5lHj1BcwB60 ZcmeIkcOyvny2kIkn09juM+ZPc6xbsY2kHoPIdXYMM3e6SI5963n9LijwD9qBuQDkaF4YW ChgH2Y4IYJ6Co4SC1BCk9LvxPE8OE5A= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6868AFEC; Thu, 5 Sep 2024 06:14:44 -0700 (PDT) Received: from [10.1.36.183] (XHFQ2J9959.cambridge.arm.com [10.1.36.183]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DDE693F73F; Thu, 5 Sep 2024 06:14:14 -0700 (PDT) Message-ID: <542b8267-39de-4593-82ef-a586bb372824@arm.com> Date: Thu, 5 Sep 2024 14:14:13 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: Allocate THP on hugezeropage wp-fault Content-Language: en-GB To: Dev Jain , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, mark.rutland@arm.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, jglisse@google.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240904100923.290042-1-dev.jain@arm.com> <20240904100923.290042-3-dev.jain@arm.com> From: Ryan Roberts In-Reply-To: <20240904100923.290042-3-dev.jain@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: DF4B1180031 X-Stat-Signature: qmuwz4i97rdpc196hxr7hbjqjwtz7x5i X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1725542058-394602 X-HE-Meta: U2FsdGVkX18UVrAhbI6/Yn3lVBuQbAZESL4DfMzz0QDmrQjxQQseN3HHA9ibMHaw0/ouW77Wpmtn2IvN+O6B+0WY9NZGFQlqUOOZSOoy1teWazR6F+y/4HWhdZsBcKO6L2Xc35r3ausWfBO9xaqPvoax4HtfDTdXnaYF6PVheC6Ts8U1wCfdnYzqEC7CNZSALqfYHNl82uPxN3/zNorpuDWrrWiBMVyLhKMLGESM7SY71xGo3+tCWVuieCbCU/Rz2mHlT2X0ef5Ak0OHeAzZAjGeET6AmGlpPdox/4MWSQBaXK9YqS6R8vN1CZdDDNYgZ6pfMEhjRcBmyHcjfIWoYXXJkVCYi8LJgtD2Me/ZhXqR/IjtWkmG3lbIFwJa/p6Kguk8Y1E2vt+Fipa2fTXkKo45+xoRWJxx2JEH+/AGnbc9TuEFlRA/Xrh5b6HlVEvL2hdrJBpvDeKv1rnqzOdxMrqAnMK8+28zigL1u0ubpHSzbQPeHcwLhLfLogCo/b7u4yVaYLUuL4zyKt54316PBPVaGIQ/y5qjgF55X1xHiyt84rYCu7BmDCOzxHtbCT7F937Dr4wvZInr0QF+4474tGhcx9ToZOooJSNg9atL00Avbotx955vKTjGaM05FAEBZx+j96Amm88QSzdTfSLKJmefTcustwpVPQMdWTMypWYfMh4IDp8IjNjLYFkX/wUvMUvFjOPF1DtpwU7AUMmLCxXqMbJaJsOkmglBBytPCfPgUAyw+dzBLbf/uo7hINgBbX9C89hDm3wC5AkRkPg4/HRdFfixJyXnJ5Npo6RTXzs9ZEEcnCi+p/WTmrUA6eODzRAmzTmxuzF7PESG7P1zOOE/jD05u9AVT5HsSq7cHA7MqDIKTFXEuxsuIBou0ymz26lIU7U95/3tyAak7N3cm9NSl54Y7KvkI2BfeCylsjwH7AqOGUKvL4nab8jAV4wLg3yrxO+QRmXX99ecOTc HkAvDfCO 3JG9k4BwMuiAUn8ceL2QfgtNF/8AxrvpM0loQho8WO17xNneyeSEtg6VTe0dBoK5a++tMA0V8N78SqyXF2Mkxy8iK48c4hcg0EaXZmNx8jRfA+bDz6pu1Zp47HaZVRoSqwogtGrVTktruiZCVWc9yzUsJ7iSQQzhWYeHUNAUL97Nw3mJiPbPX3wfoJZY9nUwoSm+53Kag2yUE9sQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 04/09/2024 11:09, Dev Jain wrote: > Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and > replace it with a PMD-mapped THP. Change the helpers introduced in the > previous patch to flush TLB entry corresponding to the hugezeropage, > and preserve PMD uffd-wp marker. In case of failure, fallback to > splitting the PMD. > > Signed-off-by: Dev Jain > --- > include/linux/huge_mm.h | 6 ++++ > mm/huge_memory.c | 79 +++++++++++++++++++++++++++++++++++------ > mm/memory.c | 5 +-- > 3 files changed, 78 insertions(+), 12 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index e25d9ebfdf89..fdd2cf473a3c 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -9,6 +9,12 @@ > #include > > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, > + unsigned long haddr, struct folio **foliop, > + unsigned long addr); > +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, > + struct vm_area_struct *vma, unsigned long haddr, > + pgtable_t pgtable); I don't think you are using either of these outside of huge_memory.c, so not sure you need to declare them here or make them non-static? > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 58125fbcc532..150163ad77d3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -943,9 +943,9 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > } > EXPORT_SYMBOL_GPL(thp_get_unmapped_area); > > -static vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, > - unsigned long haddr, struct folio **foliop, > - unsigned long addr) > +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, > + unsigned long haddr, struct folio **foliop, > + unsigned long addr) > { > struct folio *folio = vma_alloc_folio(gfp, order, vma, haddr, true); > > @@ -984,21 +984,29 @@ static void __thp_fault_success_stats(struct vm_area_struct *vma, int order) > count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); > } > > -static void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, > - struct vm_area_struct *vma, unsigned long haddr, > - pgtable_t pgtable) > +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, > + struct vm_area_struct *vma, unsigned long haddr, > + pgtable_t pgtable) > { > - pmd_t entry; > + pmd_t entry, old_pmd; > + bool is_pmd_none = pmd_none(*vmf->pmd); > > entry = mk_huge_pmd(&folio->page, vma->vm_page_prot); > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); > folio_add_lru_vma(folio, vma); > - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); > + if (!is_pmd_none) { > + old_pmd = pmdp_huge_clear_flush(vma, haddr, vmf->pmd); > + if (pmd_uffd_wp(old_pmd)) > + entry = pmd_mkuffd_wp(entry); I don't really get this; entry is writable, so I wouldn't expect to also be setting uffd-wp here? That combination is not allowed and is checked for in page_table_check_pte_flags(). It looks like you expect to get here in the uffd-wp-async case, which used to cause the pmd to be split to ptes. I'm guessing (but don't know for sure) that would cause the uffd-wp bit to be set in each of the new ptes, then during fallback to handling the wp fault on the pte, uffd would handle it? > + } > + if (pgtable) > + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); Should this call be moved outside of here? It doesn't really feel like it belongs. Could it be called before calling map_pmd_thp() for the site that has a pgtable? > set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); > update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); > add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); > - mm_inc_nr_ptes(vma->vm_mm); > + if (is_pmd_none) > + mm_inc_nr_ptes(vma->vm_mm); > } > > static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) > @@ -1576,6 +1584,50 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) > spin_unlock(vmf->ptl); > } > > +static vm_fault_t do_huge_zero_wp_pmd_locked(struct vm_fault *vmf, > + unsigned long haddr, > + struct folio *folio) > +{ > + struct vm_area_struct *vma = vmf->vma; > + vm_fault_t ret = 0; > + > + ret = check_stable_address_space(vma->vm_mm); > + if (ret) > + goto out; > + map_pmd_thp(folio, vmf, vma, haddr, NULL); > +out: > + return ret; > +} > + > +static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault *vmf, unsigned long haddr) > +{ > + struct vm_area_struct *vma = vmf->vma; > + gfp_t gfp = vma_thp_gfp_mask(vma); > + struct mmu_notifier_range range; > + struct folio *folio = NULL; > + vm_fault_t ret = 0; > + > + ret = thp_fault_alloc(gfp, HPAGE_PMD_ORDER, vma, haddr, &folio, > + vmf->address); Just checking: the PTE table was already allocated during the read fault, right? So we don't have to allocate it here. > + if (ret) > + goto out; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, haddr, > + haddr + HPAGE_PMD_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) > + goto unlock; > + ret = do_huge_zero_wp_pmd_locked(vmf, haddr, folio); > + if (!ret) > + __thp_fault_success_stats(vma, HPAGE_PMD_ORDER); > +unlock: > + spin_unlock(vmf->ptl); > + mmu_notifier_invalidate_range_end(&range); I'll confess I don't understand all the mmu notifier rules. But the doc at Documentation/mm/mmu_notifier.rst implies that the notification must be done while holding the PTL. Although that's not how wp_page_copy(). Are you confident what you have done is correct? Thanks, Ryan > +out: > + return ret; > +} > + > vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > { > const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; > @@ -1588,8 +1640,15 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); > VM_BUG_ON_VMA(!vma->anon_vma, vma); > > - if (is_huge_zero_pmd(orig_pmd)) > + if (is_huge_zero_pmd(orig_pmd)) { > + vm_fault_t ret = do_huge_zero_wp_pmd(vmf, haddr); > + > + if (!(ret & VM_FAULT_FALLBACK)) > + return ret; > + > + /* Fallback to splitting PMD if THP cannot be allocated */ > goto fallback; > + } > > spin_lock(vmf->ptl); > > diff --git a/mm/memory.c b/mm/memory.c > index 3c01d68065be..c081a25f5173 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5409,9 +5409,10 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) > if (vma_is_anonymous(vma)) { > if (likely(!unshare) && > userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) { > - if (userfaultfd_wp_async(vmf->vma)) > + if (!userfaultfd_wp_async(vmf->vma)) > + return handle_userfault(vmf, VM_UFFD_WP); > + if (!is_huge_zero_pmd(vmf->orig_pmd)) > goto split; > - return handle_userfault(vmf, VM_UFFD_WP); > } > return do_huge_pmd_wp_page(vmf); > }