From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D4BFC7115D for ; Mon, 23 Jun 2025 06:40:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF7F38D0005; Mon, 23 Jun 2025 02:40:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA87D8D0001; Mon, 23 Jun 2025 02:40:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A97D28D0005; Mon, 23 Jun 2025 02:40:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 98F2A8D0001 for ; Mon, 23 Jun 2025 02:40:10 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9A86D103F01 for ; Mon, 23 Jun 2025 06:40:09 +0000 (UTC) X-FDA: 83585715738.27.D2A9766 Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by imf09.hostedemail.com (Postfix) with ESMTP id 97B7E140006 for ; Mon, 23 Jun 2025 06:40:06 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=XQhaue6+; spf=pass (imf09.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750660808; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=txxf/mEmvb0hGWCrZrJmfpM7BlC/+NU8hdUWgKz3uC8=; b=oAlnWVgR6v+gvgvtyStrUV6sMPeWXu/fekPzhBd2jtqnmEAp3nazH7vzhFubnLKWW6gQQd d9pviP/CyZRr86O/qFaBAGQE19JnA35CBv8UUcjkHpv33481y8+0oCVtbXA8Q4HPbqpK3y /OhlFuvh5L2k4us1n/kADtwT212B6UY= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=XQhaue6+; spf=pass (imf09.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750660808; a=rsa-sha256; cv=none; b=ANJvUkt7PcNOFbKcesXJA3MZQsnq2HdRpHZv8qmnzEsivUzpS/+XW9U5uW6jolAJOQ/u1S A3YZ035gWyplVFgEJGGbyGfEKyLklfYFh2UCY7iJX1nLotMTZ8csDrghFwzG6rtgPyuz/F /VMcXJvm8Fe6JFtbaa0T22cblxe+VAo= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1750660803; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=txxf/mEmvb0hGWCrZrJmfpM7BlC/+NU8hdUWgKz3uC8=; b=XQhaue6+b04ZymYVCb2JsllwZ/7+cz+N2Mk7zVUN+8pv7XLe2Wx/GzQP1KCGLMdnPpJZ7l3qBDYdaz8tNdF1V0VFjUH4IoWH0x/E8iGfqgDWMXO8VITrDUSgB1WxFrw+u5xBwgU8frFU2z3/7AEyQH//nhXQfkjNkXd8wsdN6ho= Received: from 30.74.144.128(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WeU45O2_1750660801 cluster:ay36) by smtp.aliyun-inc.com; Mon, 23 Jun 2025 14:40:02 +0800 Message-ID: Date: Mon, 23 Jun 2025 14:40:01 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching To: Dev Jain , akpm@linux-foundation.org, david@redhat.com Cc: ziy@nvidia.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, baohua@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20250618155608.18580-1-dev.jain@arm.com> From: Baolin Wang In-Reply-To: <20250618155608.18580-1-dev.jain@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam11 X-Rspam-User: X-Rspamd-Queue-Id: 97B7E140006 X-Stat-Signature: swy5xfxa1yrohxf8951xz8amyyshyidh X-HE-Tag: 1750660806-384664 X-HE-Meta: U2FsdGVkX18iRy1YuEOFFarQCEUCCv+01KqimfcAKiwAi2gEpVpYKizPhxXKg7PzaR26eKpmmbwMYVgl7v85QqW1RxAEcUBG0ely6k3TkV6upVOlCfzWFvurvjY+864OTZQ6GgFW4uAsJki6ZYIQz/etxrP2UFqagwQp1wizlEd3CVVbRCFFA4DNrdcQnjgcwhJy34kXjwngZfNlieW9WxvjcoDL6t+OB8RjEYBNT2vEiwNhgXyKRaHtBAgLRe9bbHECj6PhrNvPUvvVUPgZ+dmxzW1xVy6uuARgbNKYfkT5NAodh+zoIR0iSC0vOYy113cSdUFG7FM2Su5xKsozwD+ukXN5ccu3Xjd43ys7jhtM572A2BG+E6+U+bcB82+CsaUJxYuyTc1mX0fexNd8RPoB9NLEN3U5wyLqmdf2dOAdmt7oePHN3cU9mj0+LYZIyGg4Ut/qbnkxH4k7Yu9VmqDh2gjDJjWmBtC/X5U3K9BcVcSDgFA1dxT3CyzUpqNcAi5ULRQmvN6QG0hUlkTqqyC05VeB3oviai+0wBEF9FmaqV0Zjwb7lGOl6ijRj1SNXN/ZnLeqk6KJXaiNcrub8n8uDrcjQE/FRkzBWKAXhdtgXBfd2UjS/voBDxBJO1CmsH6e1JU9BLbu5Xj9yN0r3QDkLvgEO7xDAV4R7k90A1Enggz6lfS48giFf1B3VEaGpqH8PwUY/8sU5WuBfT0P1dJya7uV1INPXyd8WwslzSJ2yKmti1fFqcEXb9DLe24O2w/lQ+BJ11cOaTXnLc/3UdXwo3Ihq1Oet00L10I6hAnxyuaH0ZlPpwnRATWkB5KfsdIDjzwN9IOmvQ/nIVl3XSEcZH2WgcW8v5w6oao3kM/LNdsR5zWOqxwzI6VXYO/CkvVAaJvlYUt/wlEeKED3Eoi3AOaVZo6V/0VlxFMbdBmCkf3SqXDCszthYE0U6mu3cqDI0R0ySXB94Bk5Sui mhKczOiP omjHXJqJL5j7qsESneckiL8Thm+w1IUFDG9IQ/KNMoQZDlwlLJQj7WPC9Bg0iwUkkFOBbfTAjRhAuXRCrsMvwJOaKDsyPxujkFDNlYQ8b3DI+lOGtn+HUpHuV0LpCW9DgU9nW2i6Vr8VsZ3Ox73eVf63A3SZlt+rlxGmc3nv5mHjvDiw2S3bPuGmf0ZIGsuTqHMyEBUJ2ox10b0WB2ta47/w1Z9Vv57F6crGoWpPl2cK1NSL9CziRmhD+puXr44KQlCxyMvInUW34dNkA0F1c99trh5rd9pEWfxZ+NSLFwpEWLP6D1sRU43QaHfK52ZS2jBKrODZ/lqQ0/FLuUQ2j9z2Kc0q9YapqjkPbkG6KGRhSy61qByPF4U9O1FBxd44sz2TiQtSc8Jo5Q0GzplUaYfOCGtTJafuLVbUoaCh0P1huz5vU3jFZQexpZ1LuHWcpV/Fif+dI7Q4vGLo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/18 23:56, Dev Jain wrote: > Use PTE batching to optimize collapse_pte_mapped_thp(). > > On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse. > Then, calling ptep_clear() for every pte will cause a TLB flush for every > contpte block. Instead, clear_full_ptes() does a > contpte_try_unfold_partial() which will flush the TLB only for the (if any) > starting and ending contpte block, if they partially overlap with the range > khugepaged is looking at. > > For all arches, there should be a benefit due to batching atomic operations > on mapcounts due to folio_remove_rmap_ptes(). > > Note that we do not need to make a change to the check > "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal > to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1 > pages of the folio will be equal to the corresponding pages of our > batch mapping consecutive pages. > > No issues were observed with mm-selftests. > > Signed-off-by: Dev Jain > --- > > This is rebased on: > https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/ > If there will be a v2 of either version I'll send them together. > > mm/khugepaged.c | 38 +++++++++++++++++++++++++------------- > 1 file changed, 25 insertions(+), 13 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 649ccb2670f8..7d37058eda5b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, > int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > bool install_pmd) > { > + int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL; > struct mmu_notifier_range range; > bool notified = false; > unsigned long haddr = addr & HPAGE_PMD_MASK; > + unsigned long end = haddr + HPAGE_PMD_SIZE; > struct vm_area_struct *vma = vma_lookup(mm, haddr); > struct folio *folio; > pte_t *start_pte, *pte; > pmd_t *pmd, pgt_pmd; > spinlock_t *pml = NULL, *ptl; > - int nr_ptes = 0, result = SCAN_FAIL; > int i; > > mmap_assert_locked(mm); > @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) > goto abort; > > + i = 0, addr = haddr, pte = start_pte; > /* step 2: clear page table and adjust rmap */ > - for (i = 0, addr = haddr, pte = start_pte; > - i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { > + do { > + const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; > + int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT; > + struct folio *this_folio; > struct page *page; > pte_t ptent = ptep_get(pte); > > + nr_batch_ptes = 1; > + > if (pte_none(ptent)) > continue; > /* > @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > goto abort; > } > page = vm_normal_page(vma, addr, ptent); > + this_folio = page_folio(page); > + if (folio_test_large(this_folio) && max_nr_batch_ptes != 1) > + nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent, > + max_nr_batch_ptes, flags, NULL, NULL, NULL); > + > if (folio_page(folio, i) != page) > goto abort; IMO, 'this_folio' is always equal 'folio', right? Can't we just use 'folio'? In addition, I think the folio_test_large() and max_nr_batch_ptes checks are redundant, since the 'folio' must be PMD-sized large folio after 'folio_page(folio, i) != page' check. So I think we can move the 'nr_batch_ptes' calculation after the folio_page() check, then shoule be: nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent, max_nr_batch_ptes, flags, NULL, NULL, NULL); > @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > * TLB flush can be left until pmdp_collapse_flush() does it. > * PTE dirty? Shmem page is already dirty; file is read-only. > */ > - ptep_clear(mm, addr, pte); > - folio_remove_rmap_pte(folio, page, vma); > - nr_ptes++; > - } > + clear_full_ptes(mm, addr, pte, nr_batch_ptes, false); > + folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma); > + nr_mapped_ptes += nr_batch_ptes; > + } while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE, > + pte += nr_batch_ptes, i < HPAGE_PMD_NR); > > if (!pml) > spin_unlock(ptl); > > /* step 3: set proper refcount and mm_counters. */ > - if (nr_ptes) { > - folio_ref_sub(folio, nr_ptes); > - add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); > + if (nr_mapped_ptes) { > + folio_ref_sub(folio, nr_mapped_ptes); > + add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); > } > > /* step 4: remove empty page table */ > @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > : SCAN_SUCCEED; > goto drop_folio; > abort: > - if (nr_ptes) { > + if (nr_mapped_ptes) { > flush_tlb_mm(mm); > - folio_ref_sub(folio, nr_ptes); > - add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); > + folio_ref_sub(folio, nr_mapped_ptes); > + add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes); > } > unlock: > if (start_pte)