linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	catalin.marinas@arm.com, will@kernel.org,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, suzuki.poulose@arm.com,
	steven.price@arm.com, gshan@redhat.com,
	linux-arm-kernel@lists.infradead.org,
	Jann Horn <jannh@google.com>,
	Yang Shi <yang@os.amperecomputing.com>,
	Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [PATCH 1/3] mm: Allow pagewalk without locks
Date: Fri, 6 Jun 2025 14:51:48 +0530	[thread overview]
Message-ID: <ecfed817-105d-487f-80ba-52965f926c1e@arm.com> (raw)
In-Reply-To: <6a60c052-9935-489e-a38e-1b03a1a79155@lucifer.local>


On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
> +cc Jan for page table stuff.
>
> On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
>> locks.
> Hm This is worrying.
>
> You're unconditionally making it possible for dangerous usage here - to
> walk page tables without a lock.
>
> We need to assert this is only being used in a context where this makes
> sense, e.g. a no VMA range under the right circumstances.
>
> At the very least we need asserts that we are in a circumstance where this
> is permitted.
>
> For VMAs, you must keep the VMA stable, which requires a VMA read lock at
> minimum.
>
> See
> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables
> for details where these requirements are documented.
>
> I also think we should update this documentation to cover off this non-VMA
> task context stuff. I can perhaps do this so as not to egregiously add
> workload to this series :)
>
> Also, again this commit message is not enough for such a major change to
> core mm stuff. I think you need to underline that - in non-task context -
> you are safe to manipulate _kernel_ mappings, having precluded KFENCE as a
> concern.

Sorry for late reply, after your comments I had to really go and understand
kernel pagetable walking properly by reading your process_addrs documentation
and reading the code, so that I could prepare an answer and improve my
understanding, thanks for your review!

How does the below comment above PGWALK_NOLOCK look?

"Walk without any lock. Use of this is only meant for the
  case where there is no underlying VMA, and the user has
  exclusive control over the range, guaranteeing no concurrent
  access. For example, changing permissions of vmalloc objects."

and the patch description can be modified as
"
It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
This being a non-sleepable context, we cannot take the init_mm mmap lock.
Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
locks.
Currently, apply_to_page_range is being used by __change_memory_common()
to change permissions over a range of vmalloc space, without any locking.
Patch 2 in this series shifts to the usage of walk_page_range_novma(), hence
this patch is needed. We do not need any locks because the vmalloc object
has exclusive access to the range, i.e two vmalloc objects do not share
the same physical address.
"


>
>> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
> Basically expand upon this information.
>
> Basically the commit message refers to your usage, but describes a patch
> that makes it possible to do unlocked page table walks.
>
> As I get into below, no pun intended, but this needs to be _locked down_
> heavily.
>
> - Only walk_page_range_novma() should allow it. All other functions should
>    return -EINVAL if this is set.

Sure.

>
> - walk_page_range_novma() should assert we're in the appropriate context
>    where this is feasible.

There should be two conditions: that the mm is init_mm, and the start address
belongs to the vmalloc (or module) space. I am a little nervous about the second. On searching
throughout the codebase, I could find only vmalloc and module addresses getting
modified through set_memory_* API, but I couldn't prove that all such usages
are being done on vmalloc/module addresses.

>
> - Comments should be updated accordingly.
>
> - We should assert (at least CONFIG_DEBUG_VM asserts) in every place that
>    checks for a VMA that we are not in this lock mode, since this is
>    disallowed.
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pagewalk.h |  2 ++
>>   mm/pagewalk.c            | 12 ++++++++----
>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 9700a29f8afb..9bc8853ed3de 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/* no lock is needed */
>> +	PGWALK_NOLOCK = 3,
> I'd prefer something very explicitly documenting that, at the very least, this
> can only be used for non-VMA cases.
>
> It's hard to think of a name here, but the comment should be explicit as to
> under what circumstances this is allowed.
>
>>   };
>>
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index e478777c86e1..9657cf4664b2 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	default:
>> +		break;
> Please no 'default' here, we want to be explicit and cover all cases.

Sure.

>
> And surely, since you're explicitly only allowing this for non-VMA ranges, this
> should be a WARN_ON_ONCE() or something?

Sounds good, maybe a WARN_ON_ONCE(vma)?

>
>>   	}
>>   #endif
>>   }
>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>>   	 */
> All functions other than this should explicitly disallow this locking mode
> with -EINVAL checks. I do not want to see this locking mode made available
> in a broken context.
>
> The full comment:
>
> 	/*
> 	 * 1) For walking the user virtual address space:
> 	 *
> 	 * The mmap lock protects the page walker from changes to the page
> 	 * tables during the walk.  However a read lock is insufficient to
> 	 * protect those areas which don't have a VMA as munmap() detaches
> 	 * the VMAs before downgrading to a read lock and actually tearing
> 	 * down PTEs/page tables. In which case, the mmap write lock should
> 	 * be hold.
> 	 *
> 	 * 2) For walking the kernel virtual address space:
> 	 *
> 	 * The kernel intermediate page tables usually do not be freed, so
> 	 * the mmap map read lock is sufficient. But there are some exceptions.
> 	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> 	 * to prevent the intermediate kernel pages tables belonging to the
> 	 * specified address range from being freed. The caller should take
> 	 * other actions to prevent this race.
> 	 */
>
> Are you walking kernel memory only? Point 1 above explicitly points out why
> userland novma memory requires a lock.
>
> For point 2 you need to indicate why you don't need to consider hotplugging,

Well, hotunplugging will first offline the physical memory, and since the
vmalloc object has the reference to the pages, there is no race.

> etc.
>
> But as Ryan points out elsewhere, you should be expanding this comment to
> explain your case...
>
> You should also assert you're in a context where this applies and error
> out/WARN if not.
>
>> -	if (mm == &init_mm)
>> -		mmap_assert_locked(walk.mm);
>> -	else
>> -		mmap_assert_write_locked(walk.mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK) {
> I really don't like the idea that you're allowing no lock for userland mappings.
>
> This should at the very least be:
>
> if (mm == &init_mm)  {
> 	if (ops->walk_lock != PGWALK_NOLOCK)
> 		mmap_assert_locked(walk.mm);
> } else {
> 	mmap_assert_write_locked(walk.mm);
> }

Sure.

>
>> +		if (mm == &init_mm)
>> +			mmap_assert_locked(walk.mm);
>> +		else
>> +			mmap_assert_write_locked(walk.mm);
>> +	}
>>
>>   	return walk_pgd_range(start, end, &walk);
>>   }
>> --
>> 2.30.2
>>
> We have to be _really_ careful with this stuff. It's very fiddly and
> brokenness can be a security issue.


  reply	other threads:[~2025-06-06  9:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
2025-05-30 10:33   ` Ryan Roberts
2025-05-30 21:33     ` Yang Shi
2025-05-30 10:57   ` Lorenzo Stoakes
2025-06-06  9:21     ` Dev Jain [this message]
2025-06-06  9:33       ` Dev Jain
2025-06-06 10:02       ` Lorenzo Stoakes
2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
2025-05-30 12:53   ` Ryan Roberts
2025-06-02  4:35     ` Dev Jain
2025-06-06  9:49   ` Lorenzo Stoakes
2025-06-06 10:39     ` Dev Jain
2025-06-06 10:56       ` Lorenzo Stoakes
2025-06-06 11:08         ` Dev Jain
2025-06-09  9:41     ` Dev Jain
2025-06-09 11:00       ` Lorenzo Stoakes
2025-06-09 11:31         ` Dev Jain
2025-05-30  9:04 ` [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64 Dev Jain
2025-05-30 11:14   ` Lorenzo Stoakes
2025-05-30 12:12     ` Ryan Roberts
2025-05-30 12:18       ` Lorenzo Stoakes
2025-05-30  9:24 ` [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
2025-05-30 10:03 ` Ryan Roberts
2025-05-30 10:10   ` Dev Jain
2025-05-30 10:37     ` Ryan Roberts
2025-05-30 10:42       ` Dev Jain
2025-05-30 10:51         ` Ryan Roberts
2025-05-30 11:11           ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ecfed817-105d-487f-80ba-52965f926c1e@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=gshan@redhat.com \
    --cc=jannh@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=suzuki.poulose@arm.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox