From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A20D9C4167B for ; Mon, 30 Oct 2023 11:43:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 259676B01A3; Mon, 30 Oct 2023 07:43:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 208706B01A4; Mon, 30 Oct 2023 07:43:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F8196B01A5; Mon, 30 Oct 2023 07:43:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 01D6B6B01A3 for ; Mon, 30 Oct 2023 07:43:20 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C962DA0624 for ; Mon, 30 Oct 2023 11:43:20 +0000 (UTC) X-FDA: 81401942160.23.B3BA751 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id C20A816000F for ; Mon, 30 Oct 2023 11:43:18 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698666199; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MxazePhosfqFmj8L4Lkmx5bH2iBn+RkvClQeqiUQNeA=; b=KatJvONTCC6YhlWmhvtzY9fUNhr55L2VJbkmw/v2cuWOkXiwqgP5JT1HjcVIUPSnf7NzYv vg1kc0Syf+cANSuFVPJhkMgUJjfxMwzaP6cnuuiBCSvrW7PW9XIQv/fM9DqVIaqNltTSbq jBXBpfZuBoSaymaz7K9biNWHfayDVok= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698666199; a=rsa-sha256; cv=none; b=AwqhWmPu22AboxHxSkEFmGyC5n3yzGvALatzkoq8Zxtl+gFgc+FHbB1NNVEwx94CnjRIXw yIS714ZxGvC8RpGigyJHd0QJiKnD5bRPIZkYGYMxEV1wMZckORW4qb4cquJCLRfRuSOMaX R5CUcF+y7HuTg4nZhaEK+DkrkNDcpgM= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3593FFEC; Mon, 30 Oct 2023 04:43:59 -0700 (PDT) Received: from [10.57.71.117] (unknown [10.57.71.117]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EA0353F738; Mon, 30 Oct 2023 04:43:14 -0700 (PDT) Message-ID: <5993c198-0d27-46c3-b757-3a02c2aacfc9@arm.com> Date: Mon, 30 Oct 2023 11:43:13 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Content-Language: en-GB To: John Hubbard , Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20230929114421.3761121-1-ryan.roberts@arm.com> <20230929114421.3761121-6-ryan.roberts@arm.com> <8a72da61-b2ef-48ad-ae59-0bae7ac2ce10@nvidia.com> From: Ryan Roberts In-Reply-To: <8a72da61-b2ef-48ad-ae59-0bae7ac2ce10@nvidia.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: C20A816000F X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: y49unp7puajcntez3kqeqrofro373uy1 X-HE-Tag: 1698666198-286783 X-HE-Meta: U2FsdGVkX189/jnrhAIsPivmoxVltkNKglXZQS6Femhc2zDfPIJ7J3vLjIZr30UaD/lxv6LmuwVtC03elCFWmnRarXQpPyRtWoa/+CjMGhZGq00iXbp8VtixZW8iduEbtB1SsvhoDJdHAFrD+cEmbIVDVhvBXUt3sOZ2OabyyQ7SP10/H5KQ80+nXEZw2oJ7lfmRNPA8WVo/GrAboS8SwGKHieK56lgMnb9QJ/M3xGjE+tLUHJRtbLjFoQCAjVK328rmSG4SFXEgUPCK7scOrEWbdCaQKeiEMCgYtqA7Wn+UCKEH+w1mx2cI8WRyFetzYMdLl91NWn0BRNpXr6v4zGsPtEM3vGdwg7UQH+ItVTs61nAjUcURSxAzX5cqZg/8drxZk9Uwn4Nk1h2WjknxmUKXngV37UDkqlzi2YWSFqIPcvqXHpNZDYeTW9Dg0atlMoOQSdLSzwG+YbUVrll59A8Jylv9U+yK5hr5T9FEWqD/k0d/uO2/oBrcNPneYlPItBu4+Taq7EFzAZ7610iKx+v8TE3HryH0TUdJWX57+g+/kVbUZ2OK9SREDO7n9Zslvh2Zq/bYecR9Qif5WhuEm5olf+9ISo1VOhZXN2SLbTkdzKsNCW0DDddaR0LuoK7UH7DZPb+V4aBdNwokqoOBeXH4O2+jUts4mizl0h9IM2SnIrysMmEMnm0RnH8V4VD4yINdq7MCfBYBnD2BsGaCJPz6HwmhOqEO1K5fWzogGOKUX/dLZJRda2f2x0ypvUmDL/4CCTB8+w461q0tc1I3YLLyDbY3vLs4xTVh25hC3+MnqewaOJySJ8YhoDTIU9Ho2w7t2lXmitccxK6ro73ItUwG8JNM/oZDPK50ijw5KySsK81aGVqvTA/cw4Ikx69Fetvw4m9+SNaa79ZnRPcO75pjcQSiW2xw87HXu5rQ7j6D0N6d/cYCOjBTNSb4Dm/b4Zj4xmHCq6c5KkeTWqm eaplrIMz CVZ7L9157j3kaGjSTIjjEorLrZ4+hPGGsH6RtZH8DIEPwZuGMd6mSUZHS2F3OfOc1G+4DyVUwxbnzLhVgsXve/G4ZRmEVzl61zx8SGO8XaL6/loZGSoeNe7u9tCMeGNvBlx71otuDwQlXmHO/Ts6l35U5uS3JcNmaI31aWflPnj2G8E/heoCBgFUqVyWPLyK0vvaNk5wn2riC80PydtuglSiQ06htSlwrWZCsgJMhJ9vJKQYNjmRVanHd+3x0kUjlC5mXHizGG59DyEr0KYSzvvofgGuLb+SdJChm/y/UXWQzZXk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/10/2023 00:04, John Hubbard wrote: > On 9/29/23 04:44, Ryan Roberts wrote: > > Hi Ryan, > > A few clarifying questions below. Excellent - keep them coming! > > ... >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> index 2e7c338229a6..c4860476a1f5 100644 >> --- a/include/linux/huge_mm.h >> +++ b/include/linux/huge_mm.h >> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; >>   #define HPAGE_PMD_NR (1<>     /* >> - * Mask of all large folio orders supported for anonymous THP. >> + * Mask of all large folio orders supported for anonymous THP; all orders up to >> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1 >> + * (which is a limitation of the THP implementation). >>    */ >> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER) >> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) >>     /* >>    * Mask of all large folio orders supported for file THP. >> diff --git a/mm/memory.c b/mm/memory.c >> index b5b82fc8e164..92ed9c782dc9 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>       return ret; >>   } >>   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >> +{ >> +    int i; >> + >> +    if (nr_pages == 1) >> +        return vmf_pte_changed(vmf); >> + >> +    for (i = 0; i < nr_pages; i++) { >> +        if (!pte_none(ptep_get_lockless(vmf->pte + i))) >> +            return true; > > This seems like something different than the function name implies. > It's really confusing: for a single page case, return true if the > pte in the page tables has changed, yes that is very clear. > > But then for multiple page cases, which is really the main > focus here--for that, claim that the range has changed if any > pte is present (!pte_none). Can you please help me understand > what this means? Yes I understand your confusion. Although I'm confident that the code is correct, its a bad name - I'll make the excuse that this has evolved through rebasing to cope with additions to UFFD. Perhaps something like vmf_is_large_folio_suitable() is a better name. It used to be that we would only take the do_anonymous_page() path if the pte was none; i.e. this is the first time we are faulting on an address covered by an anon VMA and we need to allocate some memory. But more recently we also end up here if the pte is a uffd_wp marker. So for a single pte, instead of checking none, we can check if the pte has changed from our original check (where we determined it was a uffd_wp marker or none). But for multiple ptes, we don't have storage to store all the original ptes from the first check. Fortunately, if uffd is in use for a vma, then we don't want to use a large folio anyway (this would break uffd semantics because we would no longer get a fault for every page). So we only care about the "same but not none" case for nr_pages=1. Would changing the name to vmf_is_large_folio_suitable() help here? > >> +    } >> + >> +    return false; >> +} >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> +static struct folio *alloc_anon_folio(struct vm_fault *vmf) >> +{ >> +    gfp_t gfp; >> +    pte_t *pte; >> +    unsigned long addr; >> +    struct folio *folio; >> +    struct vm_area_struct *vma = vmf->vma; >> +    unsigned int orders; >> +    int order; >> + >> +    /* >> +     * If uffd is active for the vma we need per-page fault fidelity to >> +     * maintain the uffd semantics. >> +     */ >> +    if (userfaultfd_armed(vma)) >> +        goto fallback; >> + >> +    /* >> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled >> +     * for this vma. Then filter out the orders that can't be allocated over >> +     * the faulting address and still be fully contained in the vma. >> +     */ >> +    orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, >> +                    BIT(PMD_ORDER) - 1); >> +    orders = transhuge_vma_suitable(vma, vmf->address, orders); >> + >> +    if (!orders) >> +        goto fallback; >> + >> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); >> +    if (!pte) >> +        return ERR_PTR(-EAGAIN); > > pte_offset_map() can only fail due to: > >     a) Wrong pmd type. These include: >         pmd_none >         pmd_bad >         pmd migration entry >         pmd_trans_huge >         pmd_devmap > >     b) __pte_map() failure > > For (a), why is it that -EAGAIN is used here? I see that that > will lead to a re-fault, I got that far, but am missing something > still. > > For (b), same question, actually. I'm not completely sure why > why a retry is going to fix a __pte_map() failure? I'm not going to claim to understand all the details of this. But this is due to a change that Hugh introduced and we concluded at [1] that its always correct to return EAGAIN here to rerun the fault. In fact, with the current implementation pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN makes it safe for tomorrow, and because this would only fail due to a race, retrying is correct. [1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/ > > >> + >> +    order = first_order(orders); >> +    while (orders) { >> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> +        vmf->pte = pte + pte_index(addr); >> +        if (!vmf_pte_range_changed(vmf, 1 << order)) >> +            break; >> +        order = next_order(&orders, order); >> +    } >> + >> +    vmf->pte = NULL; >> +    pte_unmap(pte); >> + >> +    gfp = vma_thp_gfp_mask(vma); >> + >> +    while (orders) { >> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); >> +        folio = vma_alloc_folio(gfp, order, vma, addr, true); >> +        if (folio) { >> +            clear_huge_page(&folio->page, addr, 1 << order); >> +            return folio; >> +        } >> +        order = next_order(&orders, order); >> +    } > > And finally: is it accurate to say that there are *no* special > page flags being set, for PTE-mapped THPs? I don't see any here, > but want to confirm. The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the correct flags based on transparent_hugepage/defrag file. > > > thanks,