From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0E8CC4167B for ; Thu, 7 Dec 2023 07:47:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 790D96B0078; Thu, 7 Dec 2023 02:47:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73EDF6B007B; Thu, 7 Dec 2023 02:47:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 62ECB6B0080; Thu, 7 Dec 2023 02:47:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 547E36B0078 for ; Thu, 7 Dec 2023 02:47:57 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 296F240105 for ; Thu, 7 Dec 2023 07:47:57 +0000 (UTC) X-FDA: 81539243394.11.6184C5D Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf04.hostedemail.com (Postfix) with ESMTP id 5785640004 for ; Thu, 7 Dec 2023 07:47:53 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf04.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701935275; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NkcABZuAc468dqq9+70vOqvn4OTBODRvrrwRIDwxpxA=; b=pO0hjK8bGirNfTINSNQw10klSNDPDlvLmiiXZwftIeVZQSeL9Rjf/pYOK4Bcw76o922hH7 FHDK/Ud47oUggdhBOhPlc/55ylvH558HgMd3G0cNl/KsQmBuJK53H/aG4HzCNeif6a4FNm yd2mJqJqR8hR5l3KtB1VuylZqMIIXXU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf04.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701935275; a=rsa-sha256; cv=none; b=sfxnYeboyNSyLA22oyuveaLVjfFjZ3u0j9r0Ui7Fdejgb/rPKh5BLNIzNbIIzcvOgX8w8J KhUEOwVRSySmEvpCotYkE8Vot4cHZepwrGtn5IKXd4WOr+7N9Oz6F9/NPVxGc4JCQZdVY1 J+dpG7YtTq6rxzHQ+6YkwBKy4X3e6s0= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=2;SR=0;TI=SMTPD_---0Vy-z55y_1701935269; Received: from 30.221.144.135(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0Vy-z55y_1701935269) by smtp.aliyun-inc.com; Thu, 07 Dec 2023 15:47:49 +0800 Message-ID: <17f31a27-dcb7-4907-bfa0-6282d202b7ac@linux.alibaba.com> Date: Thu, 7 Dec 2023 15:47:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds From: Xu Yu To: linux-mm@kvack.org Cc: david@redhat.com References: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com> Content-Language: en-US In-Reply-To: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: spyiqfudmbnpi4dkme1wtqsz19hbcs9h X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5785640004 X-HE-Tag: 1701935273-135055 X-HE-Meta: U2FsdGVkX1/KsFjtdRh3exhvtjwWCFHOZy/iI/hUtKWnU7UgApl5r/10X/NeeoGZ1DHaMI7viU+wIvpxSVuz3MfcNWhpA51BYzQnJXRrVVEOR1aQ+HjOZBC/GgqFkAiSN3eb9JQcgtIVJShbEf7iaduwF2c0CLFW92U8GQTU2Oi9z/cUjPZS0vuv6rIFktbfZ8/tYl+xTCTPOtkjP33ZmmpKGsPpDYU7UoIUgfBtvMOOeHAWo90vdXCHWyMhTJnTwQLZPblfG4PxkHtHNAZgDLaRqv6rKQ5ZgMAzsXEeD554rOKEfHtwrMZBT7qU1ZSkk5hF7c+oalK4rKR8kK329RoqCAFQQ8X0twBjicVuZrNzGNVNM5T6a0ZHTnGTSMVTI6sRIyWo26iSlCzBevQRIx2sNmZPDK+7lbBHSiJTaF0x0eCVovORNP09esNEd6qIW9WTWeN3xqSutjv8aoiXGIs4oSfEBiyrkKDgNMm0J7e8FXrBtFNg+XHLVaVngRTE2L6r57JYx72CeHUa+4p51ndnvdmwn4X9xHVfNZekqeXk0PjSbrlsIzr1q92nrJj7FsIzFqh15Xu1ZuxN9hrIFPN5UrZrNyIpDwq93YMgpKjpM2GBsOzzw+bHTHkbtAS0VQK5P85fSoXzlxmm3JG9beQkp/GHsPMZ5488QHMxXqf7nN2enXnRxjXUfxwFE7FYQVghRTTE3SWIDS+P/FHSBH7cYDSlMW+8c6xwCBAQEw2putLNNG+verHLteeCFJ6/w49elwMWIWqHX766E57dmG0k3g/RtKAx8CtiPrYfpfw+HY4yGmQMHhOdwqj6RFHnf2cdRiSsZjAfXoJebLvf6d+GBVayo9j8+29f8s+1/3VqTDPj1VffdqL+Q1F2IU5V3LsrhUwCKLBe5RNLWw2JcsB7fZo6KECLvl+4fUk8h58jZM3CVQ0CnDXsrnI2hKOu/e8hdZ96I5o8TgxDD+U KOuzx/fB oFelP7WTi0M1whfUOcp1qLxW2r/FU7+hH50uFco9+JfPBeJJ4abjoNSz2GoOo4NWpTfZKscLKeuLYd+YaSBl82tkpxhEBd9oYJGWY5qu3j3wQrgd5wS6/oGkiUq4Ef+rbMIqvIuOLwOIfjf05ILNUEOCoerzTqXgC6XS3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/7/23 11:09 AM, Xu Yu wrote: > In the anonymous collapse path, khugepaged always collapses pte-mapped > hugepage by allocating and copying to a new hugepage. > > In some scenarios, we can only update the mapping page tables for > anonymous pte-mapped THPs, in the same way as file/shmem-backed > pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged: > attempt to map file/shmem-backed pte-mapped THPs by pmds") > > The simplest scenario that satisfies the conditions, as David points out, > is when no subpages are PageAnonExclusive (PTEs must be R/O), we can > collapse into a R/O PMD without further action. > > Let's start from this simplest scenario. > > Signed-off-by: Xu Yu > --- > mm/khugepaged.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 214 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 88433cc25d8a..85c7a2ab44ce 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1237,6 +1237,197 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > return result; > } > > +static struct page *find_lock_pte_mapped_page(struct vm_area_struct *vma, > + unsigned long addr, pmd_t *pmd) > +{ > + pte_t *pte, pteval; > + struct page *page = NULL; > + > + pte = pte_offset_map(pmd, addr); > + if (!pte) > + return NULL; > + > + pteval = ptep_get_lockless(pte); > + if (pte_none(pteval) || !pte_present(pteval)) > + goto out; > + > + page = vm_normal_page(vma, addr, pteval); > + if (unlikely(!page) || unlikely(is_zone_device_page(page))) > + goto out; > + > + page = compound_head(page); > + > + if (!trylock_page(page)) { > + page = NULL; > + goto out; > + } > + > + if (!get_page_unless_zero(page)) { > + unlock_page(page); > + page = NULL; > + goto out; > + } > + > +out: > + pte_unmap(pte); > + return page; > +} > + > +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long haddr, bool *mmap_locked, > + struct collapse_control *cc) > +{ > + struct mmu_notifier_range range; > + struct page *hpage; > + pte_t *start_pte, *pte; > + pmd_t *pmd, pmdval; > + spinlock_t *pml, *ptl; > + pgtable_t pgtable; > + unsigned long addr; > + int exclusive = 0; > + bool writable = false; > + int result, i; > + > + /* Fast check before locking page if already PMD-mapped */ > + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); > + if (result == SCAN_PMD_MAPPED) > + return result; > + > + hpage = find_lock_pte_mapped_page(vma, haddr, pmd); > + if (!hpage) > + return SCAN_PAGE_NULL; > + if (!PageHead(hpage)) { > + result = SCAN_FAIL; > + goto drop_hpage; > + } > + if (compound_order(hpage) != HPAGE_PMD_ORDER) { > + result = SCAN_PAGE_COMPOUND; > + goto drop_hpage; > + } > + > + mmap_read_unlock(mm); > + *mmap_locked = false; > + > + /* Prevent all access to pagetables */ > + mmap_write_lock(mm); > + > + result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc); > + if (result != SCAN_SUCCEED) > + goto up_write; > + > + result = check_pmd_still_valid(mm, haddr, pmd); > + if (result != SCAN_SUCCEED) > + goto up_write; > + > + /* Recheck with mmap write lock */ > + result = SCAN_SUCCEED; > + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > + if (!start_pte) > + goto drop_hpage; ^^^^^^^^^^ should be up_write. > + for (i = 0, addr = haddr, pte = start_pte; > + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { > + struct page *page; > + pte_t pteval = ptep_get(pte); > + > + if (pte_none(pteval) || !pte_present(pteval)) { > + result = SCAN_PTE_NON_PRESENT; > + break; > + } > + > + if (pte_uffd_wp(pteval)) { > + result = SCAN_PTE_UFFD_WP; > + break; > + } > + > + if (pte_write(pteval)) > + writable = true; > + > + page = vm_normal_page(vma, addr, pteval); > + > + if (unlikely(!page) || unlikely(is_zone_device_page(page))) { > + result = SCAN_PAGE_NULL; > + break; > + } > + > + if (hpage + i != page) { > + result = SCAN_FAIL; > + break; > + } > + > + if (PageAnonExclusive(page)) > + exclusive++; > + } > + pte_unmap_unlock(start_pte, ptl); > + if (result != SCAN_SUCCEED) > + goto drop_hpage; ^^^^^^^^^^ should be up_write. > + > + /* > + * Case 1: > + * No subpages are PageAnonExclusive (PTEs must be R/O), we can > + * collapse into a R/O PMD without further action. > + */ > + if (!(exclusive == 0 && !writable)) > + goto drop_hpage; ^^^^^^^^^^ should be up_write. > + > + /* Collapse pmd entry */ > + vma_start_write(vma); > + anon_vma_lock_write(vma->anon_vma); > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, > + haddr, haddr + HPAGE_PMD_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + > + pml = pmd_lock(mm, pmd); /* probably unnecessary */ > + pmdval = pmdp_collapse_flush(vma, haddr, pmd); > + spin_unlock(pml); > + mmu_notifier_invalidate_range_end(&range); > + tlb_remove_table_sync_one(); > + > + anon_vma_unlock_write(vma->anon_vma); > + > + start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl); > + if (!start_pte) > + goto rollback; > + for (i = 0, addr = haddr, pte = start_pte; > + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { > + struct page *page; > + pte_t pteval = ptep_get(pte); > + > + page = vm_normal_page(vma, addr, pteval); > + page_remove_rmap(page, vma, false); > + } > + pte_unmap_unlock(start_pte, ptl); > + > + /* Install pmd entry */ > + pgtable = pmd_pgtable(pmdval); > + pmdval = mk_huge_pmd(hpage, vma->vm_page_prot); > + spin_lock(pml); > + page_add_anon_rmap(hpage, vma, haddr, RMAP_COMPOUND); > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + set_pmd_at(mm, haddr, pmd, pmdval); > + update_mmu_cache_pmd(vma, haddr, pmd); > + spin_unlock(pml); > + > + result = SCAN_SUCCEED; > + goto up_write; > + > +rollback: > + spin_lock(pml); > + pmd_populate(mm, pmd, pmd_pgtable(pmdval)); > + spin_unlock(pml); > + > +up_write: > + mmap_write_unlock(mm); > + > +drop_hpage: > + unlock_page(hpage); > + put_page(hpage); > + > + /* TODO: tracepoints */ > + return result; > +} > + > static int hpage_collapse_scan_pmd(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long address, bool *mmap_locked, > @@ -1251,6 +1442,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > spinlock_t *ptl; > int node = NUMA_NO_NODE, unmapped = 0; > bool writable = false; > + int exclusive = 0; > + bool is_hpage = false; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1333,8 +1526,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > } > } > > + if (PageAnonExclusive(page)) > + exclusive++; > + > page = compound_head(page); > > + if (compound_order(page) == HPAGE_PMD_ORDER) > + is_hpage = true; > + > /* > * Record which node the original page is from and save this > * information to cc->node_load[]. > @@ -1396,7 +1595,22 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > } > out_unmap: > pte_unmap_unlock(pte, ptl); > + > + if (is_hpage && (exclusive == 0 && !writable)) { > + int res; > + > + res = collapse_pte_mapped_anon_thp(mm, vma, address, > + mmap_locked, cc); > + if (res == SCAN_PMD_MAPPED || res == SCAN_SUCCEED) { > + result = res; > + goto out; > + } > + > + } > + > if (result == SCAN_SUCCEED) { > + if (!*mmap_locked) > + mmap_read_lock(mm); > result = collapse_huge_page(mm, address, referenced, > unmapped, cc); > /* collapse_huge_page will return with the mmap_lock released */ -- Thanks, Yu