From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C200FCE7AEB for ; Fri, 6 Sep 2024 09:00:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 447726B007B; Fri, 6 Sep 2024 05:00:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CFE46B0082; Fri, 6 Sep 2024 05:00:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24B766B0085; Fri, 6 Sep 2024 05:00:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 03D2E6B007B for ; Fri, 6 Sep 2024 05:00:21 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 878F4A11C4 for ; Fri, 6 Sep 2024 09:00:21 +0000 (UTC) X-FDA: 82533717042.20.351987B Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf30.hostedemail.com (Postfix) with ESMTP id AFA2A80024 for ; Fri, 6 Sep 2024 09:00:19 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725613089; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=neKFg24j1KZvS7ThUAAZwK+M400352YoFEVB93Zz+kw=; b=37A10lVC1/AUr6etKZyruQTqxIxlj+g0EJMKkBoROuW2b5YKpDUEXIVXYE7+Pt81db3ryz qL/6XDEVBEchvQMwhLVVZoyc1FSfMBuP47RUiAB6mzYY0Lml1JQuq2urxHrrz6dmxzbPJO SmIXB/dIzogHpShgTOUvI3fmfDLDP30= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725613089; a=rsa-sha256; cv=none; b=t79pH/crHW+T2ldHVJsjgiWrWxH0/D4I077lg7t5qcTeH7rZQn7KTZCAeSmZyEEKzJl2Cx mEuOZkttkbnI2yERYUeed7IQAW+y8IHWqay5gFVVDgLlf7iE5D67Gs3HmkM0yqKCxim0Q3 S3qAVSF1uZshgRIHdWzQLBBqOuGoSJI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CD180FEC; Fri, 6 Sep 2024 02:00:45 -0700 (PDT) Received: from [10.162.41.22] (e116581.arm.com [10.162.41.22]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CA92A3F73F; Fri, 6 Sep 2024 02:00:10 -0700 (PDT) Message-ID: <709c0648-c9c5-4197-83f2-64d36293b99e@arm.com> Date: Fri, 6 Sep 2024 14:30:07 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: Allocate THP on hugezeropage wp-fault To: Ryan Roberts , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, mark.rutland@arm.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, jglisse@google.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240904100923.290042-1-dev.jain@arm.com> <20240904100923.290042-3-dev.jain@arm.com> <542b8267-39de-4593-82ef-a586bb372824@arm.com> <7b7c5dc8-933b-405f-be27-907624f7f8ce@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: AFA2A80024 X-Stat-Signature: qcbstf6brcow3e34uacx8nqew8hwygwa X-Rspam-User: X-HE-Tag: 1725613219-354648 X-HE-Meta: U2FsdGVkX19nCDpGA5ouxCIS1Ur0bn1QND9jgOtUqmEYyl2iNyGIGnhMoXbMIGses4qwqqgG5P8k/zrS2o617g0HrCrz5MC8qJsx7qCFbzCiA/Ud9cjDUvtGnW/PF/xL7N7AG/mqC0KU+tGQW4RYuBWMfC7ZpIFOZLn3Gj602aykW1yP03YVvRdV1Dr7kW+TgY3S/VAeSkRPjOebj142fxk4AtuSavHCqw82xelVeX/eM1kafr9Z77p1+CsTcl5rmcxlw+CN+QqDzORsC4x7JV/rUBPG718lxzfDnXdGYFnq8NDjsNd4VRxuQ0UmPBOmTdQznCfGUghW9f/I1/rFARhivEL4det4Nfnu+YCNBcmJeEBtFFJSGixnspg1A+Yjfz+OfA2oIplzhFGoQR4pWnm21xQwT3b7QVH3BzlBVmm9laVDdj/4OQUFSSIC7RyO4QO9QV+uwYs5YGRX8mc01zM5lAEtn0L7TYRCbWKOxIAQR8xldGn+GLYalP8uNxocQ2U5Q2v45xmqFKx7oDKXqbi7ELheYp6qbnkBhCNpBSuq5z7dbEvwERMDcEfJa+LQxT1O7j12p9L5i60/UJv7eN5gkQaY3c0Oj9A4rzUAsgFQkIYeStow/jiiFaNJgQi5g8JmNOdgZil58elCzbnTXOwm0M1JpKM/O4SfQKy9le/FW+GpfGa+THbn3/zZ6nQfz7EXMsuWmVPD6DHA3cY6gFuXxXlvMHBjZZ4GVNleSBTHWuH15vjtJaVFPXhPoYerN8mtgMKmgo/urnjpviU9k0ihj6VrH7Eh/SN6k2PoSad2z9uJjCbcnwuL218tIF9rpbYASwaaHG3Xm7728KD1T7dQSu+IufM53pN6kvsvpKAZh+lQBMjlSkbyEgVm+6OFT2Ci0hrMugCsAXa3/RHpaUGFCydZ73IWAwwxq3bBvckVq+ihMoQIBjNNSQeVQvDmGuTixUdnESNLldBS+iG 6DuAFyf0 LWA/QYqu5E4l7pBz+BvAHPLJduGEr88ahL4HWWKRdMWbC3DPMyPkNUje2LlU58CZW0HBFtCCTQEW2fnDCxupAK+s5ZKCDiyriV6QXNv7DwHSl+yuCQ+SEkjGiBNDOig2H6Mnu0cQPpqgJgz+ekum00rX0ZJAZXgU2C9pp10O1RI9X00FBoAJdVqky0lEIv8I6iLIAImMnowv0BW5UNEUozRhrt1dieZrMse/Cpts8pOkazfo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9/6/24 14:13, Ryan Roberts wrote: > On 06/09/2024 08:05, Dev Jain wrote: >> On 9/5/24 18:44, Ryan Roberts wrote: >>> On 04/09/2024 11:09, Dev Jain wrote: >>>> Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and >>>> replace it with a PMD-mapped THP. Change the helpers introduced in the >>>> previous patch to flush TLB entry corresponding to the hugezeropage, >>>> and preserve PMD uffd-wp marker. In case of failure, fallback to >>>> splitting the PMD. >>>> >>>> Signed-off-by: Dev Jain >>>> --- >>>>   include/linux/huge_mm.h |  6 ++++ >>>>   mm/huge_memory.c        | 79 +++++++++++++++++++++++++++++++++++------ >>>>   mm/memory.c             |  5 +-- >>>>   3 files changed, 78 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>> index e25d9ebfdf89..fdd2cf473a3c 100644 >>>> --- a/include/linux/huge_mm.h >>>> +++ b/include/linux/huge_mm.h >>>> @@ -9,6 +9,12 @@ >>>>   #include >>>>     vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); >>>> +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, >>>> +               unsigned long haddr, struct folio **foliop, >>>> +               unsigned long addr); >>>> +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> +         struct vm_area_struct *vma, unsigned long haddr, >>>> +         pgtable_t pgtable); >>> I don't think you are using either of these outside of huge_memory.c, so not >>> sure you need to declare them here or make them non-static? >> As pointed out by Kirill, you are right, I forgot to drop these from my previous >> approach. >> >>>>   int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >>>>             pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, >>>>             struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 58125fbcc532..150163ad77d3 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -943,9 +943,9 @@ unsigned long thp_get_unmapped_area(struct file *filp, >>>> unsigned long addr, >>>>   } >>>>   EXPORT_SYMBOL_GPL(thp_get_unmapped_area); >>>>   -static vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct >>>> vm_area_struct *vma, >>>> -                  unsigned long haddr, struct folio **foliop, >>>> -                  unsigned long addr) >>>> +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, >>>> +               unsigned long haddr, struct folio **foliop, >>>> +               unsigned long addr) >>>>   { >>>>       struct folio *folio = vma_alloc_folio(gfp, order, vma, haddr, true); >>>>   @@ -984,21 +984,29 @@ static void __thp_fault_success_stats(struct >>>> vm_area_struct *vma, int order) >>>>       count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); >>>>   } >>>>   -static void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> -            struct vm_area_struct *vma, unsigned long haddr, >>>> -            pgtable_t pgtable) >>>> +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> +         struct vm_area_struct *vma, unsigned long haddr, >>>> +         pgtable_t pgtable) >>>>   { >>>> -    pmd_t entry; >>>> +    pmd_t entry, old_pmd; >>>> +    bool is_pmd_none = pmd_none(*vmf->pmd); >>>>         entry = mk_huge_pmd(&folio->page, vma->vm_page_prot); >>>>       entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); >>>>       folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); >>>>       folio_add_lru_vma(folio, vma); >>>> -    pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); >>>> +    if (!is_pmd_none) { >>>> +        old_pmd = pmdp_huge_clear_flush(vma, haddr, vmf->pmd); >>>> +        if (pmd_uffd_wp(old_pmd)) >>>> +            entry = pmd_mkuffd_wp(entry); >>> I don't really get this; entry is writable, so I wouldn't expect to also be >>> setting uffd-wp here? That combination is not allowed and is checked for in >>> page_table_check_pte_flags(). >>> >>> It looks like you expect to get here in the uffd-wp-async case, which used to >>> cause the pmd to be split to ptes. I'm guessing (but don't know for sure) that >>> would cause the uffd-wp bit to be set in each of the new ptes, then during >>> fallback to handling the wp fault on the pte, uffd would handle it? >> I guess you are correct; I missed the WARN_ON() in page_table_check_pmd_flags(), >> but I did see, if I read the uffd code correctly, that mfill_atomic() will just >> return in case of pmd_trans_huge(*dst_pmd) while doing a uffd_copy to the >> destination >> location. So preserving pmd_uffd_wp is useless in case a THP is mapped, but I >> did not >> know that in fact it is supposed to be an invalid combination. So, I will drop it, >> unless someone has some other objection. > So what's the correct way to handle uffd-wp-async in wp_huge_pmd()? Just split > it? If so, you can revert your changes in memory.c. I think so. > >>>> +    } >>>> +    if (pgtable) >>>> +        pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); >>> Should this call be moved outside of here? It doesn't really feel like it >>> belongs. Could it be called before calling map_pmd_thp() for the site that has a >>> pgtable? >> Every other place I checked, they are doing this: make the entry -> deposit >> pgtable -> >> set_pmd_at(). I guess the general flow is to do the deposit based on the old >> pmd, before >> setting the new one. Which implies: I should at least move this check before I call >> pmdp_huge_clear_flush(). And, since vmf->pmd and creating the new entry has no >> relation, >> I am inclined to do what you are saying. > pgtable_trans_huge_deposit() is just adding the pgtable to a list so that if the > THP needs to be split in future, then we have preallocated the pte pgtable so > the operation can't fail. Yes. > And enqueing it is just under the protection of the > PTL as far as I can tell. So I think the only ordering requirement is that you > both set the pmd and deposit the pgtable under the lock (without dropping it in > between). So you can deposit either before or after map_pmd_thp(). Yes I'll do that before. > And > pmdp_huge_clear_flush() is irrelavent, I think? You mean, in this context? Everywhere, pgtable deposit uses the old pmd value to be replaced as its input, that is, it is called before set_pmd_at(). So calling pgtable deposit after clear_flush() will violate this ordering. I do not think this ordering is really required but I'd rather be safe :) > >>>>       set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); >>>>       update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); >>>>       add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); >>>> -    mm_inc_nr_ptes(vma->vm_mm); >>>> +    if (is_pmd_none) >>>> +        mm_inc_nr_ptes(vma->vm_mm); >>>>   } >>>>     static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) >>>> @@ -1576,6 +1584,50 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) >>>>       spin_unlock(vmf->ptl); >>>>   } >>>>   +static vm_fault_t do_huge_zero_wp_pmd_locked(struct vm_fault *vmf, >>>> +                         unsigned long haddr, >>>> +                         struct folio *folio) >>>> +{ >>>> +    struct vm_area_struct *vma = vmf->vma; >>>> +    vm_fault_t ret = 0; >>>> + >>>> +    ret = check_stable_address_space(vma->vm_mm); >>>> +    if (ret) >>>> +        goto out; >>>> +    map_pmd_thp(folio, vmf, vma, haddr, NULL); >>>> +out: >>>> +    return ret; >>>> +} >>>> + >>>> +static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault *vmf, unsigned long >>>> haddr) >>>> +{ >>>> +    struct vm_area_struct *vma = vmf->vma; >>>> +    gfp_t gfp = vma_thp_gfp_mask(vma); >>>> +    struct mmu_notifier_range range; >>>> +    struct folio *folio = NULL; >>>> +    vm_fault_t ret = 0; >>>> + >>>> +    ret = thp_fault_alloc(gfp, HPAGE_PMD_ORDER, vma, haddr, &folio, >>>> +                  vmf->address); >>> Just checking: the PTE table was already allocated during the read fault, right? >>> So we don't have to allocate it here. >> Correct, that happens in set_huge_zero_folio(). Thanks for checking. >> >>>> +    if (ret) >>>> +        goto out; >>>> + >>>> +    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, haddr, >>>> +                haddr + HPAGE_PMD_SIZE); >>>> +    mmu_notifier_invalidate_range_start(&range); >>>> +    vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); >>>> +    if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) >>>> +        goto unlock; >>>> +    ret = do_huge_zero_wp_pmd_locked(vmf, haddr, folio); >>>> +    if (!ret) >>>> +        __thp_fault_success_stats(vma, HPAGE_PMD_ORDER); >>>> +unlock: >>>> +    spin_unlock(vmf->ptl); >>>> +    mmu_notifier_invalidate_range_end(&range); >>> I'll confess I don't understand all the mmu notifier rules. >> I confess the same :) >> >>> But the doc at >>> Documentation/mm/mmu_notifier.rst implies that the notification must be done >>> while holding the PTL. Although that's not how wp_page_copy(). Are you confident >>> what you have done is correct? >> Everywhere else, invalidate_range_end() is getting called after dropping the lock, >> one reason is that it has a might_sleep(), and therefore we cannot call it while >> holding a spinlock. I still don't know what exactly these calls mean...but I think >> what I have done is correct. > High level; they are notifying secondary MMUs (e.g. IOMMU) of a change so the > tables of those secondary MMUs can be kept in sync. I don't understand all the > ordering requirement details though. > > I think what you have is probably correct, would be good for someone that knows > what they are talking about to confirm though :) Exactly. > >>> Thanks, >>> Ryan >>> >>>> +out: >>>> +    return ret; >>>> +} >>>> + >>>>   vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >>>>   { >>>>       const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; >>>> @@ -1588,8 +1640,15 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >>>>       vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); >>>>       VM_BUG_ON_VMA(!vma->anon_vma, vma); >>>>   -    if (is_huge_zero_pmd(orig_pmd)) >>>> +    if (is_huge_zero_pmd(orig_pmd)) { >>>> +        vm_fault_t ret = do_huge_zero_wp_pmd(vmf, haddr); >>>> + >>>> +        if (!(ret & VM_FAULT_FALLBACK)) >>>> +            return ret; >>>> + >>>> +        /* Fallback to splitting PMD if THP cannot be allocated */ >>>>           goto fallback; >>>> +    } >>>>         spin_lock(vmf->ptl); >>>>   diff --git a/mm/memory.c b/mm/memory.c >>>> index 3c01d68065be..c081a25f5173 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -5409,9 +5409,10 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault >>>> *vmf) >>>>       if (vma_is_anonymous(vma)) { >>>>           if (likely(!unshare) && >>>>               userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) { >>>> -            if (userfaultfd_wp_async(vmf->vma)) >>>> +            if (!userfaultfd_wp_async(vmf->vma)) >>>> +                return handle_userfault(vmf, VM_UFFD_WP); >>>> +            if (!is_huge_zero_pmd(vmf->orig_pmd)) >>>>                   goto split; >>>> -            return handle_userfault(vmf, VM_UFFD_WP); >>>>           } >>>>           return do_huge_pmd_wp_page(vmf); >>>>       }