From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E8CCC3DA6D for ; Tue, 20 May 2025 09:18:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF9A06B0088; Tue, 20 May 2025 05:18:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD1976B0095; Tue, 20 May 2025 05:18:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0EA46B009C; Tue, 20 May 2025 05:18:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A289B6B0088 for ; Tue, 20 May 2025 05:18:17 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1D4F2120984 for ; Tue, 20 May 2025 09:18:17 +0000 (UTC) X-FDA: 83462735034.19.7E69A39 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf11.hostedemail.com (Postfix) with ESMTP id F1B874000B for ; Tue, 20 May 2025 09:18:14 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf11.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747732695; a=rsa-sha256; cv=none; b=iM7WyUx67wO5FB1qQe4cvmc3c20br3nUO4tC6uOP3DfiUR8GVSlSvHm9VN+Q5VHtSKReA0 BkyngROevlgEvBTUtpI2S2rgm8REjV2L7YuGMrOD9SOetD+Zbvw+h+POGvXWBnjFlETr03 sCbb/UEBpRtpKP+EvG3FjW22cjltXTA= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf11.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747732695; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VM3HsyLoVrKp1KWGybA63P5qMS/hOikKi8uzVtfXrwo=; b=t4w13I2Gko4pR8N2aY5zR1a1eXK74H6qaHYK8yISIJmKOtK+9WaaqUi8mhzeT6VBHIlLyO XUfRqNhrP1pq7leimj7zug3Yf2iopSklvMbu83roUS87tZC0tYphjU3oV+4DoyLum57rzf Guv1xSw9D+OU1s8YevjCWhUxOrCF2fc= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A6F33152B; Tue, 20 May 2025 02:18:00 -0700 (PDT) Received: from [10.164.18.48] (unknown [10.164.18.48]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DC7A83F5A1; Tue, 20 May 2025 02:18:07 -0700 (PDT) Message-ID: Date: Tue, 20 May 2025 14:48:05 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, ryan.roberts@arm.com, david@redhat.com, willy@infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, catalin.marinas@arm.com, will@kernel.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, vbabka@suse.cz, jannh@google.com, anshuman.khandual@arm.com, peterx@redhat.com, joey.gouly@arm.com, ioworker0@gmail.com, kevin.brodsky@arm.com, quic_zhenhuah@quicinc.com, christophe.leroy@csgroup.eu, yangyicong@hisilicon.com, linux-arm-kernel@lists.infradead.org, hughd@google.com, yang@os.amperecomputing.com, ziy@nvidia.com References: <20250519074824.42909-1-dev.jain@arm.com> <20250519074824.42909-4-dev.jain@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: F1B874000B X-Stat-Signature: fs19o5n855z96m9bjaxtg1zciuzu454r X-HE-Tag: 1747732694-435470 X-HE-Meta: U2FsdGVkX189KLvxdVTmXarkSjYpdPeCpwPFVBRTAV6dO4rVx5W9QdBjfM1TpHTNavW10z27rguMyQYm+hOUt+0whanmyOJ2YXqqxBhyazQayKXSA+3eRagKzZOHPUdUXGRMe3xrVm8hWSZDNfHSENuZngbAuenTInTifvaTa9hBK7JvwJmWAXI1xSS9w5OIZsiBATOVpyoPNUWxX63iGILFjVRtPvNeQbAnGmTDg8dZVWsSbBKg53H3PG1ugc+yRk5fCwF9cxGI+CUB+sHFP8Y4dhLUMdeWluwRbgnRO2gIGaOqDHGtQSfO+5YlO+/JQe0g/RbdEEvn0niklTI4LHF4U/1WVEQRStU1bROtdxCIr+svCXUsZVmM8hkgX7XL8/jofOHRS1Mg6L1qKYROGMZoGOKXyvJqrodoya+WYZ/DfhLu0JFbVkA6po5HoQQPMJjIij+LQfwflvIsg9YXbrqhzi/KZ/rWWlQ7JpY2xaBdzddovxe0XgfEuPVG4CqDvm3Ky8i5bsAhMq0LftFEoyqbaaYUNNr9MVE3sK0g16Inla6hZ1+uZ/0QluM82lNgRiw81QVNOLrtfhywYv0uJgWozVxMSpx6nb2kokCQqkvIbYARZp0C4cKnvBSAyHtJnFrjkhWhKWHHiQ4inUZu3lX4D+rFm2adWlkxxafg4eLo+silc7kS2M69Dp5+6wKveklKICiuD6i9I/79IODOMe7uTRSU4zsXg1Amr6q4l7Oq1qzmimyHMSW1ZO9QvMnS6kKzUonG1YB2bmCtSTfRLbNlCd7C38nc7XcKahSoEPLjcBV0bAkp0pP50lkvXg/elPhWyjtePOh50vee3ahmv/9b2ViVFJZ3bR2zaa/sVOQkwZx6uEG3jFWoXUWVdA+kyJIwjgG0j4Yquvz289kiq7HzZJQzliEkN0Q3xAIHCDKIlpGorlTY6w5UP5dPLcw3evX3FZUkyLbX9hLoDL1 nso9RxZ6 74MxgCoCX5B7ybD8wXThWEjGBoy7vO+5y3pqDcxm2fBmNfe9M7iDib0GczoGlwBHK/q4b9RekMNDHxUH5D6cWF1PDA+lTSzz34GmOCDMGUwSNL5ZMFANGeccAjE1UYBRuuHAr8g/9gjYfYHqsiWE2ckTMvoZ1Yn2Hki/fXgFiPr+gfdPa5xMOs9+ashMRciQN1PkIr3mYeEG7+mAVwVG1E416Ej2bJgbYUm+wxYCLFxMHZGA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, sorry for the late reply, my Thunderbird is not working and I am not being able to receive emails. I am manually downloading the mbox and then replying, so my email response may be slow :( On 19/05/25 1:48 pm, Barry Song wrote: > On Mon, May 19, 2025 at 7:49 PM Dev Jain wrote: >> >> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa >> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits, >> it lets us batch around pte_needs_flush() (for parisc, the definition includes >> the access bit). >> For all cases other than the PageAnonExclusive case, if the case holds true >> for one pte in the batch, one can confirm that that case will hold true for >> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass >> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits >> across the batch, therefore batching across pte_dirty(): this is correct since >> the dirty bit on the PTE really is just an indication that the folio got written >> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is), >> the wp-fault optimization can be made. >> The crux now is how to batch around the PageAnonExclusive case; we must check >> the corresponding condition for every single page. Therefore, from the large >> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive >> condition, and process that sub batch, then determine and process the next sub batch, >> and so on. Note that this does not cause any extra overhead; if suppose the size of >> the folio batch is 512, then the sub batch processing in total will take 512 iterations, >> which is the same as what we would have done before. >> >> Signed-off-by: Dev Jain >> --- >> include/linux/mm.h | 7 ++- >> mm/mprotect.c | 126 +++++++++++++++++++++++++++++++++++---------- >> 2 files changed, 104 insertions(+), 29 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 43748c8f3454..7d5b96f005dc 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen); >> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ >> MM_CP_UFFD_WP_RESOLVE) >> >> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >> - pte_t pte); >> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr, >> + pte_t pte, int max_len, int *len); >> +#define can_change_pte_writable(vma, addr, pte) \ >> + can_change_ptes_writable(vma, addr, pte, 1, NULL) >> + >> extern long change_protection(struct mmu_gather *tlb, >> struct vm_area_struct *vma, unsigned long start, >> unsigned long end, unsigned long cp_flags); >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index 124612ce3d24..6cd8cdc168fa 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -40,25 +40,36 @@ >> >> #include "internal.h" >> >> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >> - pte_t pte) >> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr, >> + pte_t pte, int max_len, int *len) >> { >> struct page *page; >> + bool temp_ret; >> + bool ret; >> + int i; >> >> - if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) >> - return false; >> + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) { >> + ret = false; >> + goto out; >> + } >> >> /* Don't touch entries that are not even readable. */ >> - if (pte_protnone(pte)) >> - return false; >> + if (pte_protnone(pte)) { >> + ret = false; >> + goto out; >> + } >> >> /* Do we need write faults for softdirty tracking? */ >> - if (pte_needs_soft_dirty_wp(vma, pte)) >> - return false; >> + if (pte_needs_soft_dirty_wp(vma, pte)) { >> + ret = false; >> + goto out; >> + } >> >> /* Do we need write faults for uffd-wp tracking? */ >> - if (userfaultfd_pte_wp(vma, pte)) >> - return false; >> + if (userfaultfd_pte_wp(vma, pte)) { >> + ret = false; >> + goto out; >> + } >> >> if (!(vma->vm_flags & VM_SHARED)) { >> /* >> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >> * any additional checks while holding the PT lock. >> */ >> page = vm_normal_page(vma, addr, pte); >> - return page && PageAnon(page) && PageAnonExclusive(page); >> + ret = (page && PageAnon(page) && PageAnonExclusive(page)); >> + if (!len) >> + return ret; >> + >> + /* Check how many consecutive pages are AnonExclusive or not */ >> + for (i = 1; i < max_len; ++i) { >> + ++page; >> + temp_ret = (page && PageAnon(page) && PageAnonExclusive(page)); >> + if (temp_ret != ret) > > Do we really need to do PageAnon for each subpage which is: > folio_test_anon(page_folio(page)) > > since we have checked subpage[0] ? You are correct. I just wrote it like that for consistency, I'll change it in v4. > >> + break; >> + } >> + *len = i; >> + return ret; >> } >> >> VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte)); >> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >> * FS was already notified and we can simply mark the PTE writable >> * just like the write-fault handler would do. >> */ >> - return pte_dirty(pte); >> + ret = pte_dirty(pte); >> + >> +out: >> + /* The entire batch is guaranteed to have the same return value */ >> + if (len) >> + *len = max_len; >> + return ret; >> } >> >> static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep, >> - pte_t pte, int max_nr_ptes) >> + pte_t pte, int max_nr_ptes, bool ignore_soft_dirty) >> { >> - const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >> + fpb_t flags = FPB_IGNORE_DIRTY; >> >> - if (!folio_test_large(folio) || (max_nr_ptes == 1)) >> + if (ignore_soft_dirty) >> + flags |= FPB_IGNORE_SOFT_DIRTY; >> + >> + if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1)) >> return 1; >> >> return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags, >> NULL, NULL, NULL); >> } >> >> +/** >> + * modify_sub_batch - Identifies a sub-batch which has the same return value >> + * of can_change_pte_writable(), from within a folio batch. max_len is the >> + * max length of the possible sub-batch. sub_batch_idx is the offset from >> + * the start of the original folio batch. >> + */ >> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb, >> + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, >> + int max_len, int sub_batch_idx) >> +{ >> + unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE; >> + pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx); >> + pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx); >> + pte_t *new_ptep = ptep + sub_batch_idx; >> + int len = 1; >> + >> + if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len)) >> + new_ptent = pte_mkwrite(new_ptent, vma); >> + >> + modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len); >> + if (pte_needs_flush(new_oldpte, new_ptent)) >> + tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE); >> + return len; >> +} >> + >> static long change_pte_range(struct mmu_gather *tlb, >> struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, >> unsigned long end, pgprot_t newprot, unsigned long cp_flags) >> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb, >> bool prot_numa = cp_flags & MM_CP_PROT_NUMA; >> bool uffd_wp = cp_flags & MM_CP_UFFD_WP; >> bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; >> - int nr_ptes; >> + int sub_batch_idx, max_len, len, nr_ptes; >> >> tlb_change_page_size(tlb, PAGE_SIZE); >> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); >> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb, >> flush_tlb_batched_pending(vma->vm_mm); >> arch_enter_lazy_mmu_mode(); >> do { >> + sub_batch_idx = 0; >> nr_ptes = 1; >> oldpte = ptep_get(pte); >> if (pte_present(oldpte)) { >> int max_nr_ptes = (end - addr) >> PAGE_SHIFT; >> + struct folio *folio = NULL; >> pte_t ptent; >> >> /* >> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb, >> * pages. See similar comment in change_huge_pmd. >> */ >> if (prot_numa) { >> - struct folio *folio; >> int nid; >> bool toptier; >> >> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb, >> toptier) { >> skip_batch: >> nr_ptes = mprotect_batch(folio, addr, pte, >> - oldpte, max_nr_ptes); >> + oldpte, max_nr_ptes, >> + true); >> continue; >> } >> if (folio_use_access_time(folio)) >> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb, >> jiffies_to_msecs(jiffies)); >> } >> >> + if (!folio) >> + folio = vm_normal_folio(vma, addr, oldpte); >> + >> + nr_ptes = mprotect_batch(folio, addr, pte, oldpte, >> + max_nr_ptes, false); >> oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); >> ptent = pte_modify(oldpte, newprot); >> >> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb, >> * example, if a PTE is already dirty and no other >> * COW or special handling is required. >> */ >> - if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && >> - !pte_write(ptent) && >> - can_change_pte_writable(vma, addr, ptent)) >> - ptent = pte_mkwrite(ptent, vma); >> - >> - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); >> - if (pte_needs_flush(oldpte, ptent)) >> - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); >> - pages++; >> + if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) { >> + max_len = nr_ptes; >> + while (sub_batch_idx < nr_ptes) { >> + >> + /* Get length of sub batch */ >> + len = modify_sub_batch(vma, tlb, addr, pte, >> + oldpte, ptent, max_len, >> + sub_batch_idx); >> + sub_batch_idx += len; >> + max_len -= len; >> + } >> + } else { >> + modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); >> + if (pte_needs_flush(oldpte, ptent)) >> + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); >> + } >> + pages += nr_ptes; >> } else if (is_swap_pte(oldpte)) { >> swp_entry_t entry = pte_to_swp_entry(oldpte); >> pte_t newpte; >> -- >> 2.30.2 >> > > Thanks > barry