From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7B49C46CA2 for ; Mon, 18 Dec 2023 02:45:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C78416B0074; Sun, 17 Dec 2023 21:45:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C27F86B0075; Sun, 17 Dec 2023 21:45:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AEFFE6B0078; Sun, 17 Dec 2023 21:45:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9EE706B0074 for ; Sun, 17 Dec 2023 21:45:54 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 50C2D8096A for ; Mon, 18 Dec 2023 02:45:54 +0000 (UTC) X-FDA: 81578399028.27.7D5CDA2 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf30.hostedemail.com (Postfix) with ESMTP id 377D480002 for ; Mon, 18 Dec 2023 02:45:50 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702867552; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7BdSqc2fTgF+NouhyN2nbg8s4w2I2u9fnMlYb7g6BzU=; b=gGxbXnOBR6bDd3of/84w/2P4ohYsVz/0gj0MLKZ3puG8lVQ7ZBu8PlD/Vk3AkRKeB/o/k9 FMoUpaO4GJVihG7rrYgvyOeT86E/+xjLy3kFY3Tx9qCo2U91S2EF54p20CUOuwvlrrNEwZ QDQNkzQXT/+cDX4j4XCbGmWL4MYt5yo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702867552; a=rsa-sha256; cv=none; b=sV9FshMVggXfzckHKyGsV9BEQTMqFk0PJnyataMkF9z7rfa346730/KypXqzNYROaCu8nG +pc2x0eCXWw4jT1clN0FkJOCB8QXLWaKGokG8F86sPxG09mKd6TA8/o5Y59Ae3OdfWYdgN /FUKlGlkekl+taxd+D7oZBenqO+Yo/c= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=2;SR=0;TI=SMTPD_---0VydrKNp_1702867546; Received: from 30.221.144.76(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0VydrKNp_1702867546) by smtp.aliyun-inc.com; Mon, 18 Dec 2023 10:45:46 +0800 Message-ID: Date: Mon, 18 Dec 2023 10:45:45 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds To: David Hildenbrand , linux-mm@kvack.org References: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com> <9032be4a-6168-41d6-8ad2-494da4e9fd6b@redhat.com> Content-Language: en-US From: Xu Yu In-Reply-To: <9032be4a-6168-41d6-8ad2-494da4e9fd6b@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: eiomys9ziuoss51qkf7fgmyjodg5k54n X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 377D480002 X-Rspam-User: X-HE-Tag: 1702867550-264817 X-HE-Meta: U2FsdGVkX1+WYrQjr2hy8kjoaKeeNu+4QxpgXKHUeVRyWDj7QvDoc8trsqTZGJMwDN7Au3HwwUndWV+dRTrEeOEBFA7xOXqRR4nmcilzPvNzMVxi4sHtK5z76dhGSuF0CmIfXmJQCQJqezSQhYcfCHtds9bVd8UliUPWMbNHR+R2zjfUWdHn4eCjEsxEzR6BiSaSQFIQJuT4LRGwd3/b6fVrGy7V9U/uIyc9H/zrT2H0556AkqB+nszUrDJ+iZGbR3EAPfBHK6utMktQ2ayLgImAdaWqCSnB3xeU71QDVuW0wd4ZrNyKK8CZgwBfHkk9ZEreCDmibTMVc+WltYBYcMqdZy/KDmBIMoaz6rsSkEMhksSuUUw8ONEsM1BCS5S0Lb9Bhk59YOCrkpZEp5RCjwxu/gyTjX+lxY8yf3ULxUiKrSkcqBUXLzCKqBD9r6MfA6IqH3gyFEk/D12zmLVHmQcr0Mj0B2ZTgUifRMjS3sEaJUSL41UmaZ0RdvCpRQteJEL7GRYusIS2Cu6fGOvwAINrUY+7BpS/+NHwlwXP28seCsRBMYVNVb5ORFkib5yeDoqBZiVDIuzIsFCzPBqy2H2YOSv/TX25coczTLXAHGXlidmarNyuwq57VFcG+J5J6tmf5tdama5NheLipDKgaBRsdIq7C8D+D+POxv+2jVZhj63A+yBGxHuyNyL/R9PTD9hdhPKM2Hv0VyIrJMKZ9YA0ktX6+kopsOUbvLY0qqDW9LtxGCZltnNM52Z/N5NgG25C9tNO21/ENQnulaPv8JHMorbLAiYd8Wl0Cj8XQ69AwEZNWY/D/L+5OelD88EfYSFJD1Uf5P4rJUFi44DifwG6suBjKj+sKMqyAG8MqQ67wNDFH1ACFQ4yRv7DnM0c5TugYB3XDVs+YZVU1RyeQHl592IkPrtbEFig056C0BlZkDyPp2BIjL5Y/EoKu16FfvqeAoOh6algKE8DYZN UyCZKaa0 sFnsT9CHkd+tAe7SHKE+wYiE0lnYh+lEqcWT8WZBzyH8iduSGT9jLLLb4RRMNvALvFjqCGhep2raDDUYV24OmyzIXb5IriXOFliK55wcbXBeNt1C2dAWk4KvsUqVHgYzQjZvJSuL+88jETG+2CwRszYv/U0BiDRvj+ecLzMopmpz7ijLzfsifOXKLl5ktfmnpYGPqRuHN28Uwb9Isn6b1LoULxNG/itvYzyvKhAgbz7Ft42j2Wn5wiZnYUSUzo9K+nnifrt8pIwNQiGIaEJ8sm/MRE5WFj60RD5IQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/7/23 6:37 PM, David Hildenbrand wrote: > On 07.12.23 04:09, Xu Yu wrote: >> In the anonymous collapse path, khugepaged always collapses pte-mapped >> hugepage by allocating and copying to a new hugepage. >> >> In some scenarios, we can only update the mapping page tables for >> anonymous pte-mapped THPs, in the same way as file/shmem-backed >> pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged: >> attempt to map file/shmem-backed pte-mapped THPs by pmds") >> >> The simplest scenario that satisfies the conditions, as David points out, >> is when no subpages are PageAnonExclusive (PTEs must be R/O), we can >> collapse into a R/O PMD without further action. >> >> Let's start from this simplest scenario. >> >> Signed-off-by: Xu Yu >> --- >>   mm/khugepaged.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++ >>   1 file changed, 214 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 88433cc25d8a..85c7a2ab44ce 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1237,6 +1237,197 @@ static int collapse_huge_page(struct mm_struct *mm, >> unsigned long address, >>       return result; >>   } >> +static struct page *find_lock_pte_mapped_page(struct vm_area_struct *vma, >> +                          unsigned long addr, pmd_t *pmd) >> +{ >> +    pte_t *pte, pteval; >> +    struct page *page = NULL; >> + >> +    pte = pte_offset_map(pmd, addr); >> +    if (!pte) >> +        return NULL; >> + >> +    pteval = ptep_get_lockless(pte); >> +    if (pte_none(pteval) || !pte_present(pteval)) >> +        goto out; >> + >> +    page = vm_normal_page(vma, addr, pteval); >> +    if (unlikely(!page) || unlikely(is_zone_device_page(page))) >> +        goto out; >> + >> +    page = compound_head(page); >> + >> +    if (!trylock_page(page)) { >> +        page = NULL; >> +        goto out; >> +    } >> + >> +    if (!get_page_unless_zero(page)) { >> +        unlock_page(page); >> +        page = NULL; >> +        goto out; >> +    } >> + >> +out: >> +    pte_unmap(pte); >> +    return page; >> +} >> + >> +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm, >> +                struct vm_area_struct *vma, >> +                unsigned long haddr, bool *mmap_locked, >> +                struct collapse_control *cc) >> +{ >> +    struct mmu_notifier_range range; >> +    struct page *hpage; >> +    pte_t *start_pte, *pte; >> +    pmd_t *pmd, pmdval; >> +    spinlock_t *pml, *ptl; >> +    pgtable_t pgtable; >> +    unsigned long addr; >> +    int exclusive = 0; >> +    bool writable = false; >> +    int result, i; >> + >> +    /* Fast check before locking page if already PMD-mapped */ >> +    result = find_pmd_or_thp_or_none(mm, haddr, &pmd); >> +    if (result == SCAN_PMD_MAPPED) >> +        return result; >> + >> +    hpage = find_lock_pte_mapped_page(vma, haddr, pmd); >> +    if (!hpage) >> +        return SCAN_PAGE_NULL; >> +    if (!PageHead(hpage)) { >> +        result = SCAN_FAIL; >> +        goto drop_hpage; >> +    } >> +    if (compound_order(hpage) != HPAGE_PMD_ORDER) { >> +        result = SCAN_PAGE_COMPOUND; >> +        goto drop_hpage; >> +    } >> + >> +    mmap_read_unlock(mm); >> +    *mmap_locked = false; >> + >> +    /* Prevent all access to pagetables */ >> +    mmap_write_lock(mm); >> + >> +    result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc); >> +    if (result != SCAN_SUCCEED) >> +        goto up_write; >> + >> +    result = check_pmd_still_valid(mm, haddr, pmd); >> +    if (result != SCAN_SUCCEED) >> +        goto up_write; >> + >> +    /* Recheck with mmap write lock */ >> +    result = SCAN_SUCCEED; >> +    start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); >> +    if (!start_pte) >> +        goto drop_hpage; >> +    for (i = 0, addr = haddr, pte = start_pte; >> +         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { >> +        struct page *page; >> +        pte_t pteval = ptep_get(pte); >> + >> +        if (pte_none(pteval) || !pte_present(pteval)) { >> +            result = SCAN_PTE_NON_PRESENT; >> +            break; >> +        } >> + >> +        if (pte_uffd_wp(pteval)) { >> +            result = SCAN_PTE_UFFD_WP; >> +            break; >> +        } >> + >> +        if (pte_write(pteval)) >> +            writable = true; >> + >> +        page = vm_normal_page(vma, addr, pteval); >> + >> +        if (unlikely(!page) || unlikely(is_zone_device_page(page))) { >> +            result = SCAN_PAGE_NULL; >> +            break; >> +        } >> + >> +        if (hpage + i != page) { >> +            result = SCAN_FAIL; >> +            break; >> +        } >> + >> +        if (PageAnonExclusive(page)) >> +            exclusive++; >> +    } >> +    pte_unmap_unlock(start_pte, ptl); >> +    if (result != SCAN_SUCCEED) >> +        goto drop_hpage; >> + > > I'm wondering if anybody can now modify the page table. Likely you also need the > VMA write lock. The mmap write lock is no longer sufficient. > >> +    /* >> +     * Case 1: >> +     * No subpages are PageAnonExclusive (PTEs must be R/O), we can >> +     * collapse into a R/O PMD without further action. >> +     */ >> +    if (!(exclusive == 0 && !writable)) >> +        goto drop_hpage; >> + >> +    /* Collapse pmd entry */ >> +    vma_start_write(vma); Will move vma_start_write() to the front, just after mmap write lock. >> +    anon_vma_lock_write(vma->anon_vma); >> + >> +    mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, >> +                haddr, haddr + HPAGE_PMD_SIZE); >> +    mmu_notifier_invalidate_range_start(&range); >> + >> +    pml = pmd_lock(mm, pmd); /* probably unnecessary */ >> +    pmdval = pmdp_collapse_flush(vma, haddr, pmd); >> +    spin_unlock(pml); >> +    mmu_notifier_invalidate_range_end(&range); >> +    tlb_remove_table_sync_one(); >> + >> +    anon_vma_unlock_write(vma->anon_vma); >> + >> +    start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl); >> +    if (!start_pte) >> +        goto rollback; > > How can that happen? I referred to the abort logic in collapse_pte_mapped_thp(), I'm not sure if anyone can modify the page table at the same time, so I added a redundant check. If mmap write lock + vma write lock + page lock is sufficient, I should delete this check. > >> +    for (i = 0, addr = haddr, pte = start_pte; >> +         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { >> +        struct page *page; >> +        pte_t pteval = ptep_get(pte); >> + >> +        page = vm_normal_page(vma, addr, pteval); >> +        page_remove_rmap(page, vma, false); >> +    } >> +    pte_unmap_unlock(start_pte, ptl); >> + >> +    /* Install pmd entry */ >> +    pgtable = pmd_pgtable(pmdval); >> +    pmdval = mk_huge_pmd(hpage, vma->vm_page_prot); >> +    spin_lock(pml); >> +    page_add_anon_rmap(hpage, vma, haddr, RMAP_COMPOUND); >> +    pgtable_trans_huge_deposit(mm, pmd, pgtable); >> +    set_pmd_at(mm, haddr, pmd, pmdval); >> +    update_mmu_cache_pmd(vma, haddr, pmd); >> +    spin_unlock(pml); > > 1) Looks like you are missing to update the refcount. > > You should take one reference before adding the new rmap, and drop the other > references when dropping the rmaps. > > 2) You should drop the old rmaps after obtaining the new rmap (+reference) > > 3) You should use folios in the new code. Many thanks! Will do in v3. > > 4) You'll soon be able to use rmap batching [1] and avoid looping over >    the pages once more manually. > > > Based on [1], the code flow likely should look something like this: Thanks again! I notice that [1] is in the patchset ("mm/rmap: interface overhaul") which is under review, should I wait for this patchset to be merged into mm-unstable or mm-stable? Or can I send out v3 without this helper first? If you don't mind, I will send out v3 first. > > > first_page = &folio->page; > ... > folio_get(folio); > folio_add_anon_rmap_pmd(folio, first_page, RMAP_NONE); > ... > folio_remove_rmap_ptes(folio, first_page, HPAGE_PMD_NR, vma); > folio_ref_sub(folio, HPAGE_PMD_NR); > ... > > > [1] https://lkml.kernel.org/r/20231204142146.91437-24-david@redhat.com > -- Thanks, Yu