From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7341CC5AE59 for ; Thu, 5 Jun 2025 09:43:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D89BA6B0388; Thu, 5 Jun 2025 05:43:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D37E86B038C; Thu, 5 Jun 2025 05:43:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD9C46B0570; Thu, 5 Jun 2025 05:43:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 994506B0388 for ; Thu, 5 Jun 2025 05:43:15 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3B45E1A0203 for ; Thu, 5 Jun 2025 09:43:15 +0000 (UTC) X-FDA: 83520858750.17.738F111 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf17.hostedemail.com (Postfix) with ESMTP id 67FF34000C for ; Thu, 5 Jun 2025 09:43:13 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ZvAUiyFA; spf=pass (imf17.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749116593; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9lThtLOtQl+h3cysY0wqvjMRcuzy/6SuzhFkQRTpDlE=; b=J7njw0Bo8x2zG0VppFTfjKrdd+zxh4ter3viqBtE2nE8phZRzTmaBmjKbY8Ptqc+juE1cL xDqsHaLOnYcLsxvTGpean1Li7Bkto99+BJgi1vfZUMGB3Gvv/twfCWPyTuh4I9jzSWtu0v 90YHpMxEcdeCmRsdlZkSBcBmPjxexiU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ZvAUiyFA; spf=pass (imf17.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749116593; a=rsa-sha256; cv=none; b=eWNaXtxKoZCEy5YU71DYwbmfCAikmkFj1ZeXQA0frg9Tr7DOS9EIiNhIEk4fSc0A+dD2GT nQr4TWDlRCJwIjjHjxgPLCOtSfHw6CnH9os3ABt9NT0yb1wfMVuK25LD2a8GNeZnXDuDx3 meR4/c28lsT+HitDOIljEcwu1eElbQ8= Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749116589; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9lThtLOtQl+h3cysY0wqvjMRcuzy/6SuzhFkQRTpDlE=; b=ZvAUiyFAu9nMRJs58c2zOJSV4g2RXqfQC2U3dVx0ORx5bnsaYBTji+MCOs0iMxnNQw6y24 mZXnixMu3Sbn/L3MmiSDVPvtVDTALEwKNR6w7KlumpV0qTDytZiZZd8lMKSpUesuNbO4G0 CJQskSEZ0XkPU78YwhMqb3U4lHrL1sI= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.600.51.1.1\)) Subject: Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <5a06f383-7d01-4ea7-a010-013063770240@lucifer.local> Date: Thu, 5 Jun 2025 17:42:16 +0800 Cc: Vlastimil Babka , Andrew Morton , Barry Song , David Hildenbrand , "Liam R . Howlett" , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Oscar Salvador , Huacai Chen , WANG Xuerui , Jonas Bonn , Stefan Kristiansson , Stafford Horne , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexandre Ghiti , Jann Horn , loongarch@lists.linux.dev, linux-kernel@vger.kernel.org, linux-openrisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Message-Id: <1AA4A4B3-AEBE-484A-8EE2-35A15035E748@linux.dev> References: <20250604141958.111300-1-lorenzo.stoakes@oracle.com> <536f5ffd-b4cf-4f35-970c-ad3a7593af2f@suse.cz> <5a06f383-7d01-4ea7-a010-013063770240@lucifer.local> To: Lorenzo Stoakes X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 67FF34000C X-Stat-Signature: 5mtx3c9p91tcjybht65oj7sjjgj3chrw X-Rspam-User: X-HE-Tag: 1749116593-199445 X-HE-Meta: U2FsdGVkX199BErkeRMX0TcPSYMfZ+vpyX7oepHuRCXIOpZ9SAwIHtT6S2p9q9iLYYnfDfpGFeSNSMEtFCij4Sh54My1/M3rJe7L4HqIK7viFI0TvH5oCa+nm6V+TT5hyYii8hltBCK6x09f6HQGNtF3rhs/57iDUVX3/Mr5/Lt7dbinrVOLPfr1K3aEvhrzd5iiCVbtUCHAt3nhxgkzOwfNfbJ53P/w+mHLdB2LZYzhCF0N4wLmpkQH5Bu8za06B+rt2FSiU414ad7rmhgDBIja1UrR463gIGOnI2GdFl7F5ZEFePZGiEWG7OyGGr2FhkJ+JWyTOLCnHCSciHLTNjgUaN0HPzSso63xq6R72y6IHtFRj0ICgH/381hl6EUG/4tJKHaxdyxoNlJuh3X3H8a5yjHUyl6aWT6Ubz0agDTbiD0ySKbTElIpuKuQSjLRruOOxgJpWyC9OXGCJUsq4cPp0MsmAR+Jz8qB5mpT1FRGxDFoB7LnfO0fpv8+Ivns3Fi3kf0YVyNwCFoidBXhTX+GTHxfDihvtezcBKyUYZgjrgALl3ZxTGLPrVBNLCR/fqvNqpbcXFysMBbknPoJiXCNedPtqzbvJTCxA+AJeTWt+JE/wjQBQcOcss9nJ5jfBvwgqQrCIvuewp19tkJVvk9w7rEbbD15MB3JSMkFQpwZZbdhiSgtXv3T5JNiQ6pM1zwJjohuuEIZrVkbbH9AoSh60cTxdvxPfrqeac0Ba9xUgSYdprT3026opTEytIEcKyG7TJQ8iQmlOBhIVj4WLoIrUNGGkiaw4dfVtNIPNxG2MSHjDq+guiX06bzIpKK1Sw0ldbRdiVcLG+TtP0Kj8Pe1U3CB6BDbj6pcG1m81Ra7BbgRAwuNIjPTY4JkaP7tBq4z6YzKF4NqR82X6AETTheP+c+KmpDJ6PRdsxk6M5hQzdYldA3+aHzCqQLQuvjeZIEV7gOsmbZZ1PJpBYP fkm7m/YB gW0EgeyDj5CMZFr153nv4UlnSLh3yu2Qwdzm0R3EMLVcITljy1B2qxl+0IaN4pCMNYRc3d0WgHPBRXc54vAqn43pitCYgMnXtDsY6qdv/ZZGgRLcmVT3zAAYOb89zRwJ2nYNGZMelD6v1X/KyqdNp3hS35OOa4kNsFjmCYgHuMKqJ12RSnihZt8iQpMKs7TXOJj7Fe5Jnw7qKUf+tmuUIyFpotJKYefmD+vIS0so7fvj11/HlKHk6ZmlCam1NgesVKmWaa2ZFZ6y9DzWrWLJ1Zp9gb0lYVpnz69WvN3gGBQYrqy1Ts5g+IihYThoISWKgXw9zSINEyHfz/fOdfQMqZ48e0A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On Jun 5, 2025, at 17:24, Lorenzo Stoakes = wrote: >=20 > On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote: >> On 6/4/25 16:19, Lorenzo Stoakes wrote: >>> The walk_page_range_novma() function is rather confusing - it = supports two >>> modes, one used often, the other used only for debugging. >>>=20 >>> The first mode is the common case of traversal of kernel page = tables, which >>> is what nearly all callers use this for. >>>=20 >>> Secondly it provides an unusual debugging interface that allows for = the >>> traversal of page tables in a userland range of memory even for that = memory >>> which is not described by a VMA. >>>=20 >>> It is far from certain that such page tables should even exist, but = perhaps >>> this is precisely why it is useful as a debugging mechanism. >>>=20 >>> As a result, this is utilised by ptdump only. Historically, things = were >>> reversed - ptdump was the only user, and other parts of the kernel = evolved >>> to use the kernel page table walking here. >>>=20 >>> Since we have some complicated and confusing locking rules for the = novma >>> case, it makes sense to separate the two usages into their own = functions. >>>=20 >>> Doing this also provide self-documentation as to the intent of the = caller - >>> are they doing something rather unusual or are they simply doing a = standard >>> kernel page table walk? >>>=20 >>> We therefore establish two separate functions - = walk_page_range_debug() for >>> this single usage, and walk_kernel_page_table_range() for general = kernel >>> page table walking. >>>=20 >>> We additionally make walk_page_range_debug() internal to mm. >>>=20 >>> Note that ptdump uses the precise same function for kernel walking = as a >>=20 >> IMHO it's not clear at this point what "the precise same function" = means. >>=20 >>> convenience, so we permit this but make it very explicit by having >>> walk_page_range_novma() invoke walk_kernel_page_table_range() in = this case. >>=20 >> ^ walk_page_range_debug() >=20 > Oops will fix. >=20 >>=20 >> Maybe this could be reworded in the sense (AFAIU) that >> walk_page_range_debug() can be used for both user space page table = walking >> or kernel depending on what mm is passed, so in the case of init_mm = it >> invokes walk_kernel_page_table_range() internally. >=20 > Sure. >=20 >>=20 >>>=20 >>> Signed-off-by: Lorenzo Stoakes >>> Acked-by: Mike Rapoport (Microsoft) >>> --- >>> v2: >>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per = David. >>> * Moved walk_page_range_debug() definition to mm/internal.h as per = Mike. >>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() = as >>> per David. >>>=20 >>> v1 resend: >>> * Actually cc'd lists... >>> * Fixed mistake in walk_page_range_novma() not handling kernel = mappings and >>> update commit message to referene. >>> * Added Mike's off-list Acked-by. >>> * Fixed up comments as per Mike. >>> * Add some historic flavour to the commit message as per Mike. >>> = https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle= .com/ >>>=20 >>> v1: >>> (accidentally sent off-list due to error in scripting) >>>=20 >>> arch/loongarch/mm/pageattr.c | 2 +- >>> arch/openrisc/kernel/dma.c | 4 +- >>> arch/riscv/mm/pageattr.c | 8 +-- >>> include/linux/pagewalk.h | 7 ++- >>> mm/hugetlb_vmemmap.c | 2 +- >>> mm/internal.h | 4 ++ >>> mm/pagewalk.c | 98 = ++++++++++++++++++++++++------------ >>> mm/ptdump.c | 3 +- >>> 8 files changed, 82 insertions(+), 46 deletions(-) >>>=20 >>> diff --git a/arch/loongarch/mm/pageattr.c = b/arch/loongarch/mm/pageattr.c >>> index 99165903908a..f5e910b68229 100644 >>> --- a/arch/loongarch/mm/pageattr.c >>> +++ b/arch/loongarch/mm/pageattr.c >>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int = numpages, pgprot_t set_mask, pgp >>> return 0; >>>=20 >>> mmap_write_lock(&init_mm); >>> - ret =3D walk_page_range_novma(&init_mm, start, end, &pageattr_ops, = NULL, &masks); >>> + ret =3D walk_kernel_page_table_range(start, end, &pageattr_ops, = NULL, &masks); >>> mmap_write_unlock(&init_mm); >>=20 >> You've removed init_mm from walk_page_range_novma() but I see most = callers >> do the locking of init_mm immediately around it. This suggests a = version >> handling that automatically? A bit complicated by the read/write >> possibilities, so maybe not worth wrapping? Just a thought, as David = says ;) >=20 > Most callers write lock interestingly, but then one read lock's, so we = can't > just assume and would need to pass a boolean which would kind of suck. Hi Lorenzo, Actually, the write lock introduced in commit 8782fb61cc848 to fix the race condition when walking user page tables can be replaced with a read lock. As explained in commit b123d09304d86, it is safe to walk kernel page tables while holding the mmap read lock. The function name `walk_kernel_page_table_range` clearly indicates its purpose: walking kernel page tables. Thus, using a read lock internally is appropriate and safe. Please correct me, if I am wrong. To further enhance robustness, it is better to add a WARN_ON check to ensure that the address range passed to walk_kernel_page_table_range is indeed within the kernel address space. This will help prevent any accidental misuse and catch issues early. Muchun, Thanks. >=20 > Also other walkers assume the caller has the lock so it's consistent = to > keep it this way. >=20 >>=20 >>>=20 >>> flush_tlb_kernel_range(start, end); >>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c >>> index 3a7b5baaa450..af932a4ad306 100644 >>> --- a/arch/openrisc/kernel/dma.c >>> +++ b/arch/openrisc/kernel/dma.c >>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t = size) >>> * them and setting the cache-inhibit bit. >>> */ >>> mmap_write_lock(&init_mm); >>> - error =3D walk_page_range_novma(&init_mm, va, va + size, >>> + error =3D walk_kernel_page_table_range(va, va + size, >>> &set_nocache_walk_ops, NULL, NULL); >>> mmap_write_unlock(&init_mm); >>>=20 >>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, = size_t size) >>>=20 >>> mmap_write_lock(&init_mm); >>> /* walk_page_range shouldn't be able to fail here */ >>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size, >>> + WARN_ON(walk_kernel_page_table_range(va, va + size, >>> &clear_nocache_walk_ops, NULL, NULL)); >>> mmap_write_unlock(&init_mm); >>> } >>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c >>> index d815448758a1..3f76db3d2769 100644 >>> --- a/arch/riscv/mm/pageattr.c >>> +++ b/arch/riscv/mm/pageattr.c >>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int = numpages, pgprot_t set_mask, >>> if (ret) >>> goto unlock; >>>=20 >>> - ret =3D walk_page_range_novma(&init_mm, lm_start, lm_end, >>> + ret =3D walk_kernel_page_table_range(lm_start, lm_end, >>> &pageattr_ops, NULL, &masks); >>=20 >> Note this and other places break the second line's arguments = alignment on >> the opening bracket. Maybe it just shows it's a bit fragile style... >>=20 >>=20 >=20 > Yeah I know :) I know you won't believe this coming from me, but I was > trying to minimise the churn :P