From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E762DD5C0C0 for ; Tue, 16 Dec 2025 03:47:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F1B316B0005; Mon, 15 Dec 2025 22:47:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EF3676B0089; Mon, 15 Dec 2025 22:47:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF1706B008A; Mon, 15 Dec 2025 22:47:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C73F66B0005 for ; Mon, 15 Dec 2025 22:47:08 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4E8AAC0468 for ; Tue, 16 Dec 2025 03:47:08 +0000 (UTC) X-FDA: 84223948536.16.22ACA1D Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) by imf04.hostedemail.com (Postfix) with ESMTP id 7CEE440004 for ; Tue, 16 Dec 2025 03:47:05 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=N7E4bWO0; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1765856826; a=rsa-sha256; cv=none; b=kSCeAF6trsUTtwWsYxzvvuCxMn/eQizVO2KShUQMxxlix0xOce9otCgVsSu0inVzXwZRf8 1NylbqP5engmahb4XwrIIbH5Eaz2YhHQBtHSc+C+HMmACl3orbTFKoZNO9l5so5AQxnwi3 yJoDnmGT/fKgQHjtsMEwWdGiDSwBbxc= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=N7E4bWO0; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1765856826; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OHI8MwEkAMVbXFoi4qV2CciOuoz7wsRR0Mr9u7abYfA=; b=jrMoVVLHfPMWvpUTKHkKBCyNgAQeg8j6hutszwvxsLOn3rQsa9g5ZKUAFy2jcLWjS1htPP NrvgHpzULKVAA5/9Tkvlqskvybw1uIJEGWJXxtdtWxsq2PeB2sGZ5O5/tdidkEAKJVOINR Q6d8JeWEydyZMkWuwCwJdA1ybmuuETk= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1765856822; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=OHI8MwEkAMVbXFoi4qV2CciOuoz7wsRR0Mr9u7abYfA=; b=N7E4bWO0eGOI7iraZyf/Jxia+B7QApkGnyLVIiiOaUMCGq5M4WInDL2M78D0/Fl+fPqMU0oMIw0XKIccIRX0iHGmROjzUQvAGD0XYnGv2IxSrT92AjJ0MBeRX6KoPM63WukFXY6aWgplrWHwUG7px5KH7R3Kb2d0Dj1R3ymd8U0= Received: from 30.74.144.116(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WuxZNce_1765856820 cluster:ay36) by smtp.aliyun-inc.com; Tue, 16 Dec 2025 11:47:01 +0800 Message-ID: <7391d74f-782c-4d5a-9f9d-02afcefdbf87@linux.alibaba.com> Date: Tue, 16 Dec 2025 11:47:00 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios To: Lorenzo Stoakes Cc: akpm@linux-foundation.org, david@kernel.org, catalin.marinas@arm.com, will@kernel.org, ryan.roberts@arm.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, riel@surriel.com, harry.yoo@oracle.com, jannh@google.com, willy@infradead.org, baohua@kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org References: <545dba5e899634bc6c8ca782417d16fef3bd049f.1765439381.git.baolin.wang@linux.alibaba.com> <4413d7f4-7734-4715-a450-2ba09db5ed22@lucifer.local> From: Baolin Wang In-Reply-To: <4413d7f4-7734-4715-a450-2ba09db5ed22@lucifer.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 7CEE440004 X-Rspamd-Server: rspam03 X-Stat-Signature: hxjmneo9759fhn9g398gbkmbn6fgecie X-Rspam-User: X-HE-Tag: 1765856825-42536 X-HE-Meta: U2FsdGVkX1/RFLCvj0xHqvSdUwtAM/bmqOnBo/WtJ8ENcvxRoAGfnlIycypgBTKmXw8gUJ5+QrGPG9drCTuXMYMerwPx6esjVORj08LJpdKjW+tPK9cDtYq5+qMPd97KwcZUpbiCBUlPx66h+Rl6g+XHlodsjew0CieMSwiuQpSwDZRQG/cle7rzdmvCIYdeoCYz2I52fDsJVZNpsQK7gSI4UYDeiHZ78pJohUl2Ro1IJuJ8GOhvSCDK+hENJEipsZQTj49oBlku7Z2Begx5tJX17bU8+fJ+/siGen3VYFwFBvyRL/Q+xnO+xYJgHCkX/lPEj5ChnFWZ6YUOrXseodMni+Yny7h23EyTbupsE/4VVAYInCHsuizhj4QtQfE45Na9vJuC/KlVwK9fg4n4q/qkAfV7Yul6vZwxzqWxRg+jP+5UxHrYHZqKh3m/k7owyGeSatNHBePBTFLgfJMs0qH17cbXlehfYtu5vOHYkqOXkwHZwWMcSOA1bmiAdOz8YjxGfVrN9wTjR+tjF1u7oO3G/lQXAgEHv2Gg0b1krHo6/9SLFvQZsGMGkqHhtv+sgSxRuZZLXk4SJ/LLC+2aWE+6196IynRt+mlettbMdcMryth0S/v/g18cETTqkkuO8CxDaoq+TOlt+37PN/nLb9ZpjjEB+VcYmHQAEGyJSUwsq3zsby242xEgec1z2Wx5CxyzE9aIUNIU+uEHhsG7Rh8wcE+WwOAXNZtCKY4HyDadjZqHHg1p2IlIoPJ+Uzs8gTJjUZP73WV3qdND0IKTwiZLCjOCwADYGJGvww9dMKxprodck/yhM/WMF2G0qzOYyWBTN5x+uIVaydnc86QWyBTSGlGdIwspbaLCNaxg06w/cpWVuvoXbEcU3JYJ1Z4LH7V6QqE4PcuhK4mRXK4gvQkil7MH5BdWiPKk8LmwEGHvZmKw9Oa+jHawSnMXB9ipd+MNqq4USnZHP/PihMa CUywjj8p ghlz738NiD7Vfbgl63nonxbud046/JMTgB8qE+XXW+yQPbnllBS/haIiqrkyZHlbo2NiaZCL5m7aUVHys0kqCoZ3CXxmPh/jASWnDlrUc0Ccs0T8tqkAGe6J1DjidLyVGrqE9vyHQL2AicKE3AyBCZh3/Yu2A0FNeoj3TIXhNll2UG1zZTqcWFhl0y8VuhPA8QXAlEYeQfhzB9b/2+CfgxoCL25d0pXCEwpXaYeY87xLsxQj9MylJlz+b0YJgusPpJ3yP8FBCHL0RoNBMDtG4iPnCWvvth1zVqNzi X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/12/15 20:22, Lorenzo Stoakes wrote: > On Thu, Dec 11, 2025 at 04:16:55PM +0800, Baolin Wang wrote: >> Currently, folio_referenced_one() always checks the young flag for each PTE >> sequentially, which is inefficient for large folios. This inefficiency is >> especially noticeable when reclaiming clean file-backed large folios, where >> folio_referenced() is observed as a significant performance hotspot. >> >> Moreover, on Arm architecture, which supports contiguous PTEs, there is already > > arm64 you mean :) Right. Will make it clear. >> an optimization to clear the young flags for PTEs within a contiguous range. >> However, this is not sufficient. We can extend this to perform batched operations >> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). >> >> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking >> of the young flags and flushing TLB entries, thereby improving performance >> during large folio reclamation. >> >> Performance testing: >> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to >> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe >> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement >> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped >> from approximately 35% to around 5%. >> >> W/o patchset: >> real 0m1.518s >> user 0m0.000s >> sys 0m1.518s >> >> W/ patchset: >> real 0m1.018s >> user 0m0.000s >> sys 0m1.018s > > That's nice! > > Have you performed the same kind of performance testing on non-arm64? As in the > past we've had a batch optimisation go horribly wrong on non-arm64 even if it > was ok on arm64 :) Yes, seems you missed my test results for the x86 machine in the commit message :) "I can observe 33% performance improvement on my Arm64 32-core server (and 10%+ improvement on my X86 machine)." >> Signed-off-by: Baolin Wang >> --- >> arch/arm64/include/asm/pgtable.h | 11 +++++++++++ >> include/linux/mmu_notifier.h | 9 +++++---- >> include/linux/pgtable.h | 19 +++++++++++++++++++ >> mm/rmap.c | 22 ++++++++++++++++++++-- >> 4 files changed, 55 insertions(+), 6 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index e03034683156..a865bd8c46a3 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, >> return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES); >> } >> >> +#define clear_flush_young_ptes clear_flush_young_ptes >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep, >> + unsigned int nr) >> +{ >> + if (likely(nr == 1)) >> + return __ptep_clear_flush_young(vma, addr, ptep); >> + >> + return contpte_clear_flush_young_ptes(vma, addr, ptep, nr); >> +} > > Hmm again this is a weird way of exposing a contepte-specific function, you > really need to rework that as discussed in patch 1/3. > > It seems to me we can share code to avoid this. Sorry I don't think so. This is the current way of exposing a contpte for Arm64. Please take a look at set_ptes(), clear_full_ptes(), wrprotect_ptes() and so on (in this file). >> #define wrprotect_ptes wrprotect_ptes >> static __always_inline void wrprotect_ptes(struct mm_struct *mm, >> unsigned long addr, pte_t *ptep, unsigned int nr) >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> index d1094c2d5fb6..be594b274729 100644 >> --- a/include/linux/mmu_notifier.h >> +++ b/include/linux/mmu_notifier.h >> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner( >> range->owner = owner; >> } >> >> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ >> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr) \ >> ({ \ >> int __young; \ >> struct vm_area_struct *___vma = __vma; \ >> unsigned long ___address = __address; \ >> - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ >> + unsigned int ___nr = __nr; \ >> + __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \ >> __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ >> ___address, \ >> ___address + \ >> - PAGE_SIZE); \ >> + nr * PAGE_SIZE); \ >> __young; \ >> }) > > An aside, but I wonder why this needs to be a (pretty disgusting) macro? Um, I can send a follow-up to clean up all these related macros. >> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm) >> >> #define mmu_notifier_range_update_to_read_only(r) false >> >> -#define ptep_clear_flush_young_notify ptep_clear_flush_young >> +#define ptep_clear_flush_young_notify clear_flush_young_ptes >> #define pmdp_clear_flush_young_notify pmdp_clear_flush_young >> #define ptep_clear_young_notify ptep_test_and_clear_young >> #define pmdp_clear_young_notify pmdp_test_and_clear_young >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index b13b6f42be3c..c7d0fd228cb7 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr, >> } >> #endif >> >> +#ifndef clear_flush_young_ptes >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep, >> + unsigned int nr) >> +{ >> + int young = 0; >> + >> + for (;;) { >> + young |= ptep_clear_flush_young(vma, addr, ptep); >> + if (--nr == 0) >> + break; >> + ptep++; >> + addr += PAGE_SIZE; >> + } >> + >> + return young; >> +} >> +#endif >> + >> /* >> * On some architectures hardware does not set page access bit when accessing >> * memory page, it is responsibility of software setting this bit. It brings >> diff --git a/mm/rmap.c b/mm/rmap.c >> index d6799afe1114..ec232165c47d 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio, >> struct folio_referenced_arg *pra = arg; >> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); >> int ptes = 0, referenced = 0; >> + unsigned int nr; >> >> while (page_vma_mapped_walk(&pvmw)) { >> address = pvmw.address; >> + nr = 1; >> >> if (vma->vm_flags & VM_LOCKED) { >> ptes++; >> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio, >> if (lru_gen_look_around(&pvmw)) >> referenced++; >> } else if (pvmw.pte) { >> + if (folio_test_large(folio)) { >> + unsigned long end_addr = pmd_addr_end(address, vma->vm_end); >> + unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; >> + pte_t pteval = ptep_get(pvmw.pte); >> + >> + nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr); > > I do wish we could put this fiddly logic into a helper for each place in > which we do similar kind 'end of the PTE table, maximum number we could > have' logic. Um, the logic is already simple, and I don’t think adding a new helper would improve readability. If some code can reuse this logic, we can factor it out into a helper at that point. >> + } > > NIT but we're running into pretty long lines here. OK. Will fix this. >> + >> + ptes += nr; >> if (ptep_clear_flush_young_notify(vma, address, >> - pvmw.pte)) >> + pvmw.pte, nr)) >> referenced++; > > I find this referenced logic weird, it seems like it should be a boolean, > but this is outside the scope of your patch here :) Right. Thanks for reviewing.