From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE958C7115D for ; Mon, 23 Jun 2025 07:22:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4CC9D8D0005; Mon, 23 Jun 2025 03:22:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 47CD38D0001; Mon, 23 Jun 2025 03:22:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 36C1F8D0005; Mon, 23 Jun 2025 03:22:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 22B1F8D0001 for ; Mon, 23 Jun 2025 03:22:06 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BAACA161C7B for ; Mon, 23 Jun 2025 07:22:05 +0000 (UTC) X-FDA: 83585821410.11.1B85A53 Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 1A04A40012 for ; Mon, 23 Jun 2025 07:22:02 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=cLJD+BXy; spf=pass (imf11.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750663324; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fe3/s3OejDaQf//CI7xjYbtMNmlWr2NpSicUpEcWoms=; b=H/8OtdT7TWGTf7F7XwtDwb59bzEKEvZjT+CGD9MG7bGfe7dRp4nX2Q3FKKdhQ9bmodfguP xFhYigxjxnhIL6SFrxURxVTO2G5FsstWfttEbtGpbNzKVVnBaumrL5XhSTnZ5s+pcswmGQ YMBudCfsMkvuGLyMncrGWXXo5hoWb7U= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=cLJD+BXy; spf=pass (imf11.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750663324; a=rsa-sha256; cv=none; b=l9agZ2B2T9U13rmV2A4l8Oqvd86lrEoVEDaLPG0Fxc9885ytBqoGI4eN1buG7bwaEO8Sxw VIlNUsGyjPCLvmXUGSx7H8ECHTmtfAAC3bK9OcPqkle+lmAVF7E0za50tktCwUQQB3nBql EZ7dOI/nbVdcDfACdIOScN7eRreO9Zg= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750663318; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=fe3/s3OejDaQf//CI7xjYbtMNmlWr2NpSicUpEcWoms=; b=cLJD+BXyF1ZkAMFPXlD+NkdRxnr5oinS3zXq7Mmqt2qVwWteHQYwCpXuBjMhX5RKHmKwXRMhaSecetAwx0rVvkJxXRdlB2VyvOjc1rJkbmZehEYNGyZ7dTAsYR7gtIt1pM8jISytb4oljBT0mRBYNkK9jAT5HH9XFi7Omn01Dho= Received: from 30.74.144.128(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WeUw8as_1750663316 cluster:ay36) by smtp.aliyun-inc.com; Mon, 23 Jun 2025 15:21:57 +0800 Message-ID: Date: Mon, 23 Jun 2025 15:21:56 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching To: Dev Jain , akpm@linux-foundation.org, david@redhat.com Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, baohua@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20250618155608.18580-1-dev.jain@arm.com> <0c20196b-f5bd-4238-bbb9-316f6ac3078e@arm.com> From: Baolin Wang In-Reply-To: <0c20196b-f5bd-4238-bbb9-316f6ac3078e@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 1A04A40012 X-Stat-Signature: zqwfbt5x9k6pure3jt74yzpi6o5idsw7 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1750663322-355111 X-HE-Meta: U2FsdGVkX1+4TV60J/VZNZ2JTVdFdPyi3Sh5DsyOskD3MscP97GOf0F8rNhQpcpH1AxB6Jg0eNDjxK7v6TczY6tsoqWuK6qoO/AWiTmPpdhtbY9kBaX/0plVQjocelyIiTxC6S9mgNqtFfUmGkaxO3YG/mRnD+3Elm/qVmwDU/B7DbEi8tzdUfBRooVsJWV0at+Ta5q69mDRbn51BT7d2tYT+s/bJ8jlp+4reg65jHn2Y3r1V8rLZG42pMfhjo3ig32LghpFTFgIVybk6yadpWJDv41Y+In9eMvGHdt37W/Av2va16ho7h/W19qbv3rBY1gTLwJPCPL1ZCKx6NaIoL2w+fNwmjTudK0mqMvJa+CdckQsY6iy10rcSw5NEntGPLvobaeNh6AJXx2Kqz6p3upjaA7k4/SDH4xvezbuMM68xO3XFCPK887MD7DxjwJPbnZOzEPFg0jmKveDRFjo1pbu/UHLP3ZpnGkDF/32XsOUOMqEcGMomvAN/8fIc0A4gKz0nv+SlAxzGXxSr1Du3IM/s6YmpFuvg93KtqE6kwuXW+6O6zc214i2gm/mJc57Kvtj/WCKjRQuumfwAoB4k7eqR62Asu7S43SXhyV6JB94DluKZwIzxIbeHXW+K6sDbzCDLhA13cPGZHq5uLqmxR576VXSfo5sm4NSOGscY9bA3PLtG3TdlMzdPdCqkA2XjghUWIIYgbxhTxoC0NsIeMB9N24gNDiBWA+JUg4R5E2TOCZCKu8OtVsSUiMFNXnQG3ULeSYKD0x+IC2oT2Pmf/OC8bYSN/t8Jih4jMMfTv00wuDwmVJL4DSdQ6EoMRJs6UQPp4DKKAdJ4qDwFsJ27JuosWe/+6hj1dObHIAkmCPAjTNfug0Cm+ohjyh0Zh2CSAROud2GMEnUsyFWHZHQRkDvrLJ0qkUwDrMxabOCRDByckqomESdkBcURvZlVmIfThpEke7dZhjkwN8XJYu FxxINZ+J g+mawczB5NevO/0B8aOSzesfZsEnms5G51k8szp/b2oDSISUh3ZwCnO3AqAWRVakmzmitJTkZrv3rk5zCR0Xaoo4zSPybAyj4o3giR3JLmoVYmBwTzxMFuzI1iorlSv62nVohRoAlbN24gEY/gCOIAUG0YbriGqruwKh617jkgiNoCz3Q3Z/4C8wWlb8wJ6MclOy/Nm4JsKe3TEnUxfm0XTn0NEqfiAihEB9slcs9QRsSm7HgxMIM+qiSy/hy4Ko0Ctn0A5vO9O/SOM6tCN92R5cq2san/94eKVyGvJ1EFQ6p7yrrgDWm2te1LvBv7HgbusE/vEm9M39HM3NuUsDFwifKAPyPXQQVAU2R7l681ubf7E8aVjc2/+sCGGfYXAVbYeTpPkPmfjYcgWtUGLk/6vqn7zQmcwzxSYKCjykOlgm3uKsn7e+vo9qw3ntqqQVeMoEtc59oB3IBSEXJkJ28rM9eo/FM7ynK3uqE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/23 15:16, Dev Jain wrote: > > On 23/06/25 12:10 pm, Baolin Wang wrote: >> >> >> On 2025/6/18 23:56, Dev Jain wrote: >>> Use PTE batching to optimize collapse_pte_mapped_thp(). >>> >>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for >>> collapse. >>> Then, calling ptep_clear() for every pte will cause a TLB flush for >>> every >>> contpte block. Instead, clear_full_ptes() does a >>> contpte_try_unfold_partial() which will flush the TLB only for the >>> (if any) >>> starting and ending contpte block, if they partially overlap with the >>> range >>> khugepaged is looking at. >>> >>> For all arches, there should be a benefit due to batching atomic >>> operations >>> on mapcounts due to folio_remove_rmap_ptes(). >>> >>> Note that we do not need to make a change to the check >>> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal >>> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1 >>> pages of the folio will be equal to the corresponding pages of our >>> batch mapping consecutive pages. >>> >>> No issues were observed with mm-selftests. >>> >>> Signed-off-by: Dev Jain >>> --- >>> >>> This is rebased on: >>> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/ >>> If there will be a v2 of either version I'll send them together. >>> >>>   mm/khugepaged.c | 38 +++++++++++++++++++++++++------------- >>>   1 file changed, 25 insertions(+), 13 deletions(-) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 649ccb2670f8..7d37058eda5b 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct >>> *vma, unsigned long addr, >>>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, >>>                   bool install_pmd) >>>   { >>> +    int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL; >>>       struct mmu_notifier_range range; >>>       bool notified = false; >>>       unsigned long haddr = addr & HPAGE_PMD_MASK; >>> +    unsigned long end = haddr + HPAGE_PMD_SIZE; >>>       struct vm_area_struct *vma = vma_lookup(mm, haddr); >>>       struct folio *folio; >>>       pte_t *start_pte, *pte; >>>       pmd_t *pmd, pgt_pmd; >>>       spinlock_t *pml = NULL, *ptl; >>> -    int nr_ptes = 0, result = SCAN_FAIL; >>>       int i; >>>         mmap_assert_locked(mm); >>> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct >>> *mm, unsigned long addr, >>>       if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) >>>           goto abort; >>>   +    i = 0, addr = haddr, pte = start_pte; >>>       /* step 2: clear page table and adjust rmap */ >>> -    for (i = 0, addr = haddr, pte = start_pte; >>> -         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { >>> +    do { >>> +        const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >>> +        int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT; >>> +        struct folio *this_folio; >>>           struct page *page; >>>           pte_t ptent = ptep_get(pte); >>>   +        nr_batch_ptes = 1; >>> + >>>           if (pte_none(ptent)) >>>               continue; >>>           /* >>> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct >>> *mm, unsigned long addr, >>>               goto abort; >>>           } >>>           page = vm_normal_page(vma, addr, ptent); >>> +        this_folio = page_folio(page); >>> +        if (folio_test_large(this_folio) && max_nr_batch_ptes != 1) >>> +            nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, >>> ptent, >>> +                    max_nr_batch_ptes, flags, NULL, NULL, NULL); >>> + >>>           if (folio_page(folio, i) != page) >>>               goto abort; >> >> IMO, 'this_folio' is always equal 'folio', right? Can't we just use >> 'folio'? > > I don't think so. What if we have mremapped some bytes of this PMD range > > to point to another folio. Then 'folio_page(folio, i) != page' can catch this, which is why I suggest you move the 'nr_batch_ptes' calculation after the folio_page() check. >> In addition, I think the folio_test_large() and max_nr_batch_ptes >> checks are redundant, since the 'folio' must be PMD-sized large folio >> after 'folio_page(folio, i) != page' check. > > As an improvement we can at least do likely(folio_test_large()) since > this is very likely. > > >> >> So I think we can move the 'nr_batch_ptes' calculation after the >> folio_page() check, then shoule be: >> >> nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent, >>             max_nr_batch_ptes, flags, NULL, NULL, NULL); >> >>> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct >>> *mm, unsigned long addr, >>>            * TLB flush can be left until pmdp_collapse_flush() does it. >>>            * PTE dirty? Shmem page is already dirty; file is read-only. >>>            */ >>> -        ptep_clear(mm, addr, pte); >>> -        folio_remove_rmap_pte(folio, page, vma); >>> -        nr_ptes++; >>> -    } >>> +        clear_full_ptes(mm, addr, pte, nr_batch_ptes, false); >>> +        folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma); >>> +        nr_mapped_ptes += nr_batch_ptes; >>> +    } while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE, >>> +         pte += nr_batch_ptes, i < HPAGE_PMD_NR); >>>         if (!pml) >>>           spin_unlock(ptl); >>>         /* step 3: set proper refcount and mm_counters. */ >>> -    if (nr_ptes) { >>> -        folio_ref_sub(folio, nr_ptes); >>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); >>> +    if (nr_mapped_ptes) { >>> +        folio_ref_sub(folio, nr_mapped_ptes); >>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); >>>       } >>>         /* step 4: remove empty page table */ >>> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct >>> *mm, unsigned long addr, >>>               : SCAN_SUCCEED; >>>       goto drop_folio; >>>   abort: >>> -    if (nr_ptes) { >>> +    if (nr_mapped_ptes) { >>>           flush_tlb_mm(mm); >>> -        folio_ref_sub(folio, nr_ptes); >>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); >>> +        folio_ref_sub(folio, nr_mapped_ptes); >>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); >>>       } >>>   unlock: >>>       if (start_pte) >>