linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jinjiang Tu <tujinjiang@huawei.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Yang Shi <yang@os.amperecomputing.com>,
	Ard Biesheuvel <ardb@kernel.org>, Dev Jain <dev.jain@arm.com>,
	<scott@os.amperecomputing.com>, <cl@gentwo.org>
Cc: <linux-arm-kernel@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH v7 1/6] arm64: Enable permission change on arm64 kernel block mappings
Date: Thu, 4 Sep 2025 11:40:28 +0800	[thread overview]
Message-ID: <7705c29b-4f08-4b56-aab3-024795ee9124@huawei.com> (raw)
In-Reply-To: <20250829115250.2395585-2-ryan.roberts@arm.com>

[-- Attachment #1: Type: text/plain, Size: 12555 bytes --]


在 2025/8/29 19:52, Ryan Roberts 写道:
> From: Dev Jain<dev.jain@arm.com>
>
> This patch paves the path to enable huge mappings in vmalloc space and
> linear map space by default on arm64. For this we must ensure that we
> can handle any permission games on the kernel (init_mm) pagetable.
> Previously, __change_memory_common() used apply_to_page_range() which
> does not support changing permissions for block mappings. We move away
> from this by using the pagewalk API, similar to what riscv does right
> now. It is the responsibility of the caller to ensure that the range
> over which permissions are being changed falls on leaf mapping
> boundaries. For systems with BBML2, this will be handled in future
> patches by dyanmically splitting the mappings when required.
>
> Unlike apply_to_page_range(), the pagewalk API currently enforces the
> init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
> mmap_lock for our usecase, this patch extends this generic API to be
> used locklessly, so as to retain the existing behaviour for changing
> permissions. Apart from this reason, it is noted at [1] that KFENCE can
> manipulate kernel pgtable entries during softirqs. It does this by
> calling set_memory_valid() -> __change_memory_common(). This being a
> non-sleepable context, we cannot take the init_mm mmap lock.
>
> Add comments to highlight the conditions under which we can use the
> lockless variant - no underlying VMA, and the user having exclusive
> control over the range, thus guaranteeing no concurrent access.
>
> We require that the start and end of a given range do not partially
> overlap block mappings, or cont mappings. Return -EINVAL in case a
> partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
> add a corresponding comment in update_range_prot() to warn that
> eliminating such a condition is the responsibility of the caller.
>
> Note that, the pte level callback may change permissions for a whole
> contpte block, and that will be done one pte at a time, as opposed to an
> atomic operation for the block mappings. This is fine as any access will
> decode either the old or the new permission until the TLBI.
>
> apply_to_page_range() currently performs all pte level callbacks while
> in lazy mmu mode. Since arm64 can optimize performance by batching
> barriers when modifying kernel pgtables in lazy mmu mode, we would like
> to continue to benefit from this optimisation. Unfortunately
> walk_kernel_page_table_range() does not use lazy mmu mode. However,
> since the pagewalk framework is not allocating any memory, we can safely
> bracket the whole operation inside lazy mmu mode ourselves. Therefore,
> wrap the call to walk_kernel_page_table_range() with the lazy MMU
> helpers.
>
> Link:https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/ [1]
> Signed-off-by: Dev Jain<dev.jain@arm.com>
> Reviewed-by: Ryan Roberts<ryan.roberts@arm.com>
> Reviewed-by: Catalin Marinas<catalin.marinas@arm.com>
> ---
>   arch/arm64/mm/pageattr.c | 153 +++++++++++++++++++++++++++++++--------
>   include/linux/pagewalk.h |   3 +
>   mm/pagewalk.c            |  36 ++++++---
>   3 files changed, 149 insertions(+), 43 deletions(-)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 04d4a8f676db..6da8cbc32f46 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
>   #include <linux/mem_encrypt.h>
>   #include <linux/sched.h>
>   #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>   
>   #include <asm/cacheflush.h>
>   #include <asm/pgtable-prot.h>
> @@ -20,6 +21,99 @@ struct page_change_data {
>   	pgprot_t clear_mask;
>   };
>   
> +static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
> +{
> +	struct page_change_data *masks = walk->private;
> +
> +	val &= ~(pgprot_val(masks->clear_mask));
> +	val |= (pgprot_val(masks->set_mask));
> +
> +	return val;
> +}
> +
> +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pgd_t val = pgdp_get(pgd);
> +
> +	if (pgd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
> +			return -EINVAL;
> +		val = __pgd(set_pageattr_masks(pgd_val(val), walk));
> +		set_pgd(pgd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	p4d_t val = p4dp_get(p4d);
> +
> +	if (p4d_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
> +			return -EINVAL;
> +		val = __p4d(set_pageattr_masks(p4d_val(val), walk));
> +		set_p4d(p4d, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t val = pudp_get(pud);
> +
> +	if (pud_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> +			return -EINVAL;
> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
> +		set_pud(pud, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t val = pmdp_get(pmd);
> +
> +	if (pmd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> +			return -EINVAL;
> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> +		set_pmd(pmd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}

#define pmd_leaf(pmd)        (pmd_present(pmd) && !pmd_table(pmd))
#define pmd_present(pmd)	pte_present(pmd_pte(pmd))
#define pte_present(pte)	(pte_valid(pte) || pte_present_invalid(pte))

When PTE_VALID is cleared, pmd_leaf() will return false. As a result, the
caller won't be able to set PTE_VALID for the pmd.

> +
> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t val = __ptep_get(pte);
> +
> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
> +	__set_pte(pte, val);
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops pageattr_ops = {
> +	.pgd_entry	= pageattr_pgd_entry,
> +	.p4d_entry	= pageattr_p4d_entry,
> +	.pud_entry	= pageattr_pud_entry,
> +	.pmd_entry	= pageattr_pmd_entry,
> +	.pte_entry	= pageattr_pte_entry,
> +};
> +
>   bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>   
>   bool can_set_direct_map(void)
> @@ -37,32 +131,35 @@ bool can_set_direct_map(void)
>   		arm64_kfence_can_set_direct_map() || is_realm_world();
>   }
>   
> -static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
> +static int update_range_prot(unsigned long start, unsigned long size,
> +			     pgprot_t set_mask, pgprot_t clear_mask)
>   {
> -	struct page_change_data *cdata = data;
> -	pte_t pte = __ptep_get(ptep);
> +	struct page_change_data data;
> +	int ret;
>   
> -	pte = clear_pte_bit(pte, cdata->clear_mask);
> -	pte = set_pte_bit(pte, cdata->set_mask);
> +	data.set_mask = set_mask;
> +	data.clear_mask = clear_mask;
>   
> -	__set_pte(ptep, pte);
> -	return 0;
> +	arch_enter_lazy_mmu_mode();
> +
> +	/*
> +	 * The caller must ensure that the range we are operating on does not
> +	 * partially overlap a block mapping, or a cont mapping. Any such case
> +	 * must be eliminated by splitting the mapping.
> +	 */
> +	ret = walk_kernel_page_table_range_lockless(start, start + size,
> +						    &pageattr_ops, NULL, &data);
> +	arch_leave_lazy_mmu_mode();
> +
> +	return ret;
>   }
>   
> -/*
> - * This function assumes that the range is mapped with PAGE_SIZE pages.
> - */
>   static int __change_memory_common(unsigned long start, unsigned long size,
> -				pgprot_t set_mask, pgprot_t clear_mask)
> +				  pgprot_t set_mask, pgprot_t clear_mask)
>   {
> -	struct page_change_data data;
>   	int ret;
>   
> -	data.set_mask = set_mask;
> -	data.clear_mask = clear_mask;
> -
> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> -					&data);
> +	ret = update_range_prot(start, size, set_mask, clear_mask);
>   
>   	/*
>   	 * If the memory is being made valid without changing any other bits
> @@ -174,32 +271,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>   
>   int set_direct_map_invalid_noflush(struct page *page)
>   {
> -	struct page_change_data data = {
> -		.set_mask = __pgprot(0),
> -		.clear_mask = __pgprot(PTE_VALID),
> -	};
> +	pgprot_t clear_mask = __pgprot(PTE_VALID);
> +	pgprot_t set_mask = __pgprot(0);
>   
>   	if (!can_set_direct_map())
>   		return 0;
>   
> -	return apply_to_page_range(&init_mm,
> -				   (unsigned long)page_address(page),
> -				   PAGE_SIZE, change_page_range, &data);
> +	return update_range_prot((unsigned long)page_address(page),
> +				 PAGE_SIZE, set_mask, clear_mask);
>   }
>   
>   int set_direct_map_default_noflush(struct page *page)
>   {
> -	struct page_change_data data = {
> -		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
> -		.clear_mask = __pgprot(PTE_RDONLY),
> -	};
> +	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
> +	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
>   
>   	if (!can_set_direct_map())
>   		return 0;
>   
> -	return apply_to_page_range(&init_mm,
> -				   (unsigned long)page_address(page),
> -				   PAGE_SIZE, change_page_range, &data);
> +	return update_range_prot((unsigned long)page_address(page),
> +				 PAGE_SIZE, set_mask, clear_mask);
>   }
>   
>   static int __set_memory_enc_dec(unsigned long addr,
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 682472c15495..88e18615dd72 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -134,6 +134,9 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   int walk_kernel_page_table_range(unsigned long start,
>   		unsigned long end, const struct mm_walk_ops *ops,
>   		pgd_t *pgd, void *private);
> +int walk_kernel_page_table_range_lockless(unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		pgd_t *pgd, void *private);
>   int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>   			unsigned long end, const struct mm_walk_ops *ops,
>   			void *private);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 648038247a8d..936689d8bcac 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -606,10 +606,32 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>   		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
>   {
> -	struct mm_struct *mm = &init_mm;
> +	/*
> +	 * Kernel intermediate page tables are usually not freed, so the mmap
> +	 * read lock is sufficient. But there are some exceptions.
> +	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> +	 * to prevent the intermediate kernel pages tables belonging to the
> +	 * specified address range from being freed. The caller should take
> +	 * other actions to prevent this race.
> +	 */
> +	mmap_assert_locked(&init_mm);
> +
> +	return walk_kernel_page_table_range_lockless(start, end, ops, pgd,
> +						     private);
> +}
> +
> +/*
> + * Use this function to walk the kernel page tables locklessly. It should be
> + * guaranteed that the caller has exclusive access over the range they are
> + * operating on - that there should be no concurrent access, for example,
> + * changing permissions for vmalloc objects.
> + */
> +int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
> +		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
> +{
>   	struct mm_walk walk = {
>   		.ops		= ops,
> -		.mm		= mm,
> +		.mm		= &init_mm,
>   		.pgd		= pgd,
>   		.private	= private,
>   		.no_vma		= true
> @@ -620,16 +642,6 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>   	if (!check_ops_valid(ops))
>   		return -EINVAL;
>   
> -	/*
> -	 * Kernel intermediate page tables are usually not freed, so the mmap
> -	 * read lock is sufficient. But there are some exceptions.
> -	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> -	 * to prevent the intermediate kernel pages tables belonging to the
> -	 * specified address range from being freed. The caller should take
> -	 * other actions to prevent this race.
> -	 */
> -	mmap_assert_locked(mm);
> -
>   	return walk_pgd_range(start, end, &walk);
>   }
>   

[-- Attachment #2: Type: text/html, Size: 12936 bytes --]

  reply	other threads:[~2025-09-04  3:40 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29 11:52 [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Ryan Roberts
2025-08-29 11:52 ` [PATCH v7 1/6] arm64: Enable permission change on arm64 kernel block mappings Ryan Roberts
2025-09-04  3:40   ` Jinjiang Tu [this message]
2025-09-04 11:06     ` Ryan Roberts
2025-09-04 11:49       ` Jinjiang Tu
2025-09-04 13:21         ` Ryan Roberts
2025-09-16 21:37       ` Yang Shi
2025-08-29 11:52 ` [PATCH v7 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list Ryan Roberts
2025-08-29 22:08   ` Yang Shi
2025-09-04 11:07     ` Ryan Roberts
2025-09-03 17:24   ` Catalin Marinas
2025-09-04  0:49     ` Yang Shi
2025-08-29 11:52 ` [PATCH v7 3/6] arm64: mm: support large block mapping when rodata=full Ryan Roberts
2025-09-03 19:15   ` Catalin Marinas
2025-09-04  0:52     ` Yang Shi
2025-09-04 11:09     ` Ryan Roberts
2025-09-04 11:15   ` Ryan Roberts
2025-09-04 14:57     ` Yang Shi
2025-08-29 11:52 ` [PATCH v7 4/6] arm64: mm: Optimize split_kernel_leaf_mapping() Ryan Roberts
2025-08-29 22:11   ` Yang Shi
2025-09-03 19:20   ` Catalin Marinas
2025-09-04 11:09     ` Ryan Roberts
2025-08-29 11:52 ` [PATCH v7 5/6] arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs Ryan Roberts
2025-09-04 16:59   ` Catalin Marinas
2025-09-04 17:54     ` Yang Shi
2025-09-08 15:25     ` Ryan Roberts
2025-08-29 11:52 ` [PATCH v7 6/6] arm64: mm: Optimize linear_map_split_to_ptes() Ryan Roberts
2025-08-29 22:27   ` Yang Shi
2025-09-04 11:10     ` Ryan Roberts
2025-09-04 14:58       ` Yang Shi
2025-09-04 17:00   ` Catalin Marinas
2025-09-01  5:04 ` [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Dev Jain
2025-09-01  8:03   ` Ryan Roberts
2025-09-03  0:21     ` Yang Shi
2025-09-03  0:50       ` Yang Shi
2025-09-04 13:14         ` Ryan Roberts
2025-09-04 13:16           ` Ryan Roberts
2025-09-04 17:47             ` Yang Shi
2025-09-04 21:49               ` Yang Shi
2025-09-08 16:34                 ` Ryan Roberts
2025-09-08 18:31                   ` Yang Shi
2025-09-09 14:36                     ` Ryan Roberts
2025-09-09 15:32                       ` Yang Shi
2025-09-09 16:32                         ` Ryan Roberts
2025-09-09 17:32                           ` Yang Shi
2025-09-11 22:03                             ` Yang Shi
2025-09-17 16:28                               ` Ryan Roberts
2025-09-17 17:21                                 ` Yang Shi
2025-09-17 18:58                                   ` Ryan Roberts
2025-09-17 19:15                                     ` Yang Shi
2025-09-17 19:40                                       ` Ryan Roberts
2025-09-17 19:59                                         ` Yang Shi
2025-09-16 23:44               ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7705c29b-4f08-4b56-aab3-024795ee9124@huawei.com \
    --to=tujinjiang@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=ryan.roberts@arm.com \
    --cc=scott@os.amperecomputing.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox