From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2525EC3ABC0 for ; Thu, 8 May 2025 08:05:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2EF166B000A; Thu, 8 May 2025 04:05:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A0B96B0082; Thu, 8 May 2025 04:05:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18DA56B0085; Thu, 8 May 2025 04:05:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id F29036B000A for ; Thu, 8 May 2025 04:05:45 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C3E9680F53 for ; Thu, 8 May 2025 08:05:46 +0000 (UTC) X-FDA: 83419006692.24.295E835 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id BBEC0160006 for ; Thu, 8 May 2025 08:05:44 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746691545; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X3FbzSvDX1hMZGvxuoq1UV+KjL7HNntFKxtI2eSjodk=; b=RvqIKNg0tq2GIaCBSxSMwJSiEil3ht3K4Ykf/SAVMZ/J7aPRM5w3zOkfTe1aYv0wzJm5Xr Inic0L1jvj+39co+adzZQel0RTSp/DyKUWzrWAWP7sN0qGENBAXvwj+kRdp3eKXNJXjKl6 1qERZq3FJgHuFc2xRG1xg53btBLJpu0= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746691545; a=rsa-sha256; cv=none; b=4nOdp+H0Wr03Pm7YMArV/ktPGUWYLNCpsbtnzXVisRPcVRTUUWaLX7l78xhObcFJRvojlW 5R5uNwMk2XynBaK8w4l587q1/uv+izhaE8aytRo9AS7k6WWUK99T8bJuNPHJuezXTzZ/xZ KvQF/dLzuNzYxkqkMN7Vx7/DMBgO+jA= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5F49B106F; Thu, 8 May 2025 01:05:33 -0700 (PDT) Received: from [10.162.43.19] (K4MQJ0H1H2.blr.arm.com [10.162.43.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C74283F5A1; Thu, 8 May 2025 01:05:36 -0700 (PDT) Message-ID: Date: Thu, 8 May 2025 13:35:33 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching To: Anshuman Khandual , akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, vbabka@suse.cz, jannh@google.com, pfalcato@suse.de, linux-mm@kvack.org, linux-kernel@vger.kernel.org, david@redhat.com, peterx@redhat.com, ryan.roberts@arm.com, mingo@kernel.org, libang.li@antgroup.com, maobibo@loongson.cn, zhengqi.arch@bytedance.com, baohua@kernel.org, willy@infradead.org, ioworker0@gmail.com, yang@os.amperecomputing.com, baolin.wang@linux.alibaba.com, ziy@nvidia.com, hughd@google.com References: <20250507060256.78278-1-dev.jain@arm.com> <20250507060256.78278-3-dev.jain@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: xjsqp3zqqxm3gonp8pnxnwquk1jsyoti X-Rspamd-Queue-Id: BBEC0160006 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746691544-873448 X-HE-Meta: U2FsdGVkX1+N+YdDeIK25A1jUObpUEa38TgCgT5BHR3D0om86HeUA6yjQjNLXTA8A419TceiBwd35bYJ3BuOygfJwQ5Aibvi4WC/LnO2MgKGJPGdV+2alMJ1USkQ3iNvBLosvMMkQ9vp3ok82osNuVD1rbvdqmqtK7mhNWN11JsUPlaPBpepQQiSeBZxaR9JkJoGSlKuZS/M2i1+KN5FPty3UfZU35jsstushQM7rnKTCGVySLUG7pak0lRtXc+TDkVz4ufCAVWdSZgcSsfmee3hFZi9Fvtz+8EzD4xUEZEQUpZkP+PzP/09eMmFx7Bct7dp48wABfz5iHXA417yFZuFQqDa2Rz6ykMP61n5Q4y68bLYILG05qocrTjyHp2ew6qJRiS0S8RWKkyj4qSiuy+yj/hKx3CvoQx0hcQRPZFNTJfPWkQN7lK0ONI0MShB7wjrxrxBM7RrMFtfa76rcOx4F/pVxe0qiDhsUQgQnhg9Hv89wyDq4835gnMAIEzw94FHufAnw5JSWXMoHIm+ySRUV9ojDQhtMFNWDM1vVlJwxRUx9GXdSf5QTfjhktP53VoQs4oeETfHt7rR5ueUJP3TJClOzkWZkcRGUDO0uHECGuPCn9dBuVsYPma1699uzMzbtg9O5QXIYNfPYARpNAG+kxwnyh8et7/sUw8YaotvajTxtHFpaKnc15Rr9nI7E543+doRw2/DvzzCEnliVU++sRSqC7pTcfHPuF4vYU6ns/5xf6ouWkiecdXDL9UePgIZrGFLJPIN9uJ3W2/ccizKSpvXle1qdXjGFh+stQ3quGgL7ZIc//IRxAyTnSMNADVFTSX+weV1eigNYvMFVjWmF7L75nq9ugydZXjXyPuIBk3/wq8e0j/ihWOk3Y8+bkfbkPnm7IJYIvB6AwxePBtUlll1VTiBXGKjnag5QB9BNhTnFpH+8XmyXlUjpGUGARpUKf81vEbJjYgf+vg IK8tkSgm YwLpqC8LtYXWg7OBpzrvO9+vPbncLbMOaaeEy9/uojF/OQKNUVDw/INRkQc1yrFew1FlLmSF8Pylt9mUvQdIJgs/rZhXDjSEg09zGu7q2JdTbazIWtY9JmOZFb1tCbGD1oD6+j6AfnyJhml99ZbexiAOV1spSIH5BVU+iAfBSFMdN82Qup7Z2b9zmYxYK+uU2+a2kwVARrvDddaI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 08/05/25 12:46 pm, Anshuman Khandual wrote: > On 5/7/25 11:32, Dev Jain wrote: >> To use PTE batching, we want to determine whether the folio mapped by >> the PTE is large, thus requiring the use of vm_normal_folio(). We want >> to avoid the cost of vm_normal_folio() if the code path doesn't already >> require the folio. For arm64, pte_batch_hint() does the job. To generalize >> this hint, add a helper which will determine whether two consecutive PTEs >> point to consecutive PFNs, in which case there is a high probability that >> the underlying folio is large. >> Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes >> are painted with the contig bit, then ptep_get() will iterate through all 16 >> entries to collect a/d bits. Hence this optimization will result in a 16x >> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear() >> will eventually call contpte_try_unfold() on every contig block, thus >> flushing the TLB for the complete large folio range. Instead, use >> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only >> do them on the starting and ending contig block. >> >> Signed-off-by: Dev Jain >> --- >> include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++ >> mm/mremap.c | 37 ++++++++++++++++++++++++++++++------- >> 2 files changed, 59 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index b50447ef1c92..38dab1f562ed 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp) >> } >> #endif >> >> +/** >> + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs >> + * to a large folio. >> + * @ptep: Pointer to the page table entry. >> + * @pte: The page table entry. >> + * >> + * This helper is invoked when the caller wants to batch over a set of ptes >> + * mapping a large folio, but the concerned code path does not already have >> + * the folio. We want to avoid the cost of vm_normal_folio() only to find that >> + * the underlying folio was small; i.e keep the small folio case as fast as >> + * possible. >> + * >> + * The caller must ensure that ptep + 1 exists. >> + */ >> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte) >> +{ >> + pte_t *next_ptep, next_pte; >> + >> + if (pte_batch_hint(ptep, pte) != 1) >> + return true; >> + >> + next_ptep = ptep + 1; >> + next_pte = ptep_get(next_ptep); >> + if (!pte_present(next_pte)) >> + return false; >> + >> + return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1); >> +} >> + >> #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG >> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> unsigned long address, >> diff --git a/mm/mremap.c b/mm/mremap.c >> index 0163e02e5aa8..9c88a276bec4 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c >> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte) >> return pte; >> } >> >> +/* mremap a batch of PTEs mapping the same large folio */ >> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr, >> + pte_t *ptep, pte_t pte, int max_nr) >> +{ >> + const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; >> + struct folio *folio; >> + int nr = 1; > > A small nit - s/nr/nr_pages ? Well, all other places nr is being used, so I would like to keep it simple and stick to convention :) > >> + >> + if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) { > > Like mentioned earlier in v1, could maybe_contiguous_pte_pfns() here > add some additional cost for buffers that are actually not mapped to > contig physical pages. > > The test case you have mentioned in the cover demonstrating performance > gains might have always been run just after boot, thus increasing the > probability of contiguous physical mapping, which will not be the case > on fragmented memory systems. In that case the proposed consecutive PFN > comparison will always happen unconditionally without any benefit ? I think you mean to say that the underlying folio may not be actually large but the buddy allocator distributed consecutive physical memory. Hmm...at this rate I am thinking that the overhead of vm_normal_folio() + folio_test_large() is acceptable and is less churn :) Would like to hear your thoughts. > > Just curious. > > From V1 > > -------------------------------------------------------------------- > maybe_contiguous_pte_pfns() cost will be applicable for memory > areas greater than a single PAGE_SIZE (i.e max_nr != 1) ? This > helper extracts an additional consecutive pte, ensures that it > is valid mapped and extracts pfn before comparing for the span. > > There is some cost associated with the above code sequence which > looks justified for sequential access of memory buffers that has > consecutive physical memory backing. But what happens when such > buffers are less probable, will those buffers take a performance > hit for all the comparisons that just turn out to be negative ? > -------------------------------------------------------------------- > >> + folio = vm_normal_folio(vma, addr, pte); >> + if (folio && folio_test_large(folio)) >> + nr = folio_pte_batch(folio, addr, ptep, pte, max_nr, >> + flags, NULL, NULL, NULL); >> + } >> + return nr; >> +} >> + >> static int move_ptes(struct pagetable_move_control *pmc, >> unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd) >> { >> @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc, >> bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); >> struct mm_struct *mm = vma->vm_mm; >> pte_t *old_ptep, *new_ptep; >> - pte_t pte; >> + pte_t old_pte, pte; >> pmd_t dummy_pmdval; >> spinlock_t *old_ptl, *new_ptl; >> bool force_flush = false; >> @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc, >> unsigned long old_end = old_addr + extent; >> unsigned long len = old_end - old_addr; >> int err = 0; >> + int max_nr; > > A small nit - s/max_nr/max_nr_pages ? > >> >> /* >> * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma >> @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc, >> flush_tlb_batched_pending(vma->vm_mm); >> arch_enter_lazy_mmu_mode(); >> >> - for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE, >> - new_ptep++, new_addr += PAGE_SIZE) { >> - if (pte_none(ptep_get(old_ptep))) >> + for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE, >> + new_ptep += nr, new_addr += nr * PAGE_SIZE) { > > >> + max_nr = (old_end - old_addr) >> PAGE_SHIFT; >> + old_pte = ptep_get(old_ptep); >> + if (pte_none(old_pte)) >> continue; >> >> - pte = ptep_get_and_clear(mm, old_addr, old_ptep); >> /* >> * If we are remapping a valid PTE, make sure >> * to flush TLB before we drop the PTL for the >> @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc, >> * the TLB entry for the old mapping has been >> * flushed. >> */ >> - if (pte_present(pte)) >> + if (pte_present(old_pte)) { >> + nr = mremap_folio_pte_batch(vma, old_addr, old_ptep, >> + old_pte, max_nr); >> force_flush = true; >> + } >> + pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0); >> pte = move_pte(pte, old_addr, new_addr); >> pte = move_soft_dirty_pte(pte); >> >> @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc, >> else if (is_swap_pte(pte)) >> pte = pte_swp_clear_uffd_wp(pte); >> } >> - set_pte_at(mm, new_addr, new_ptep, pte); >> + set_ptes(mm, new_addr, new_ptep, pte, nr); >> } >> } >>