[PATCH v2 0/2] Avoid memory corruption caused by per-VMA locks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Avoid memory corruption caused by per-VMA locks
@ 2023-07-05  6:37 Suren Baghdasaryan
  2023-07-05  6:37 ` [PATCH v2 1/2] fork: lock VMAs of the parent process when forking Suren Baghdasaryan
  2023-07-05  6:37 ` [PATCH v2 2/2] mm: disable CONFIG_PER_VMA_LOCK until its fixed Suren Baghdasaryan
  0 siblings, 2 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2023-07-05  6:37 UTC (permalink / raw)
  To: akpm
  Cc: jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, david, dhowells, hughd, bigeasy, kent.overstreet,
	punit.agrawal, lstoakes, peterjung1337, rientjes, chriscli,
	axelrasmussen, joelaf, minchan, rppt, jannh, shakeelb, tatashin,
	edumazet, gthelen, linux-mm, linux-kernel, stable,
	Suren Baghdasaryan

A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking
while forking a child process.

Patch 1/2 in the series implements proper VMA locking during fork.
I tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.

Patch 2/2 disabled per-VMA locks until the fix is tested and verified.

Both patches apply cleanly over Linus' ToT and stable 6.4.y branch.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com

Suren Baghdasaryan (2):
  fork: lock VMAs of the parent process when forking
  mm: disable CONFIG_PER_VMA_LOCK until its fixed

 kernel/fork.c | 1 +
 mm/Kconfig    | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

-- 
2.41.0.255.g8b1d071c50-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] fork: lock VMAs of the parent process when forking
  2023-07-05  6:37 [PATCH v2 0/2] Avoid memory corruption caused by per-VMA locks Suren Baghdasaryan
@ 2023-07-05  6:37 ` Suren Baghdasaryan
  2023-07-05  8:08   ` David Hildenbrand
  2023-07-05  6:37 ` [PATCH v2 2/2] mm: disable CONFIG_PER_VMA_LOCK until its fixed Suren Baghdasaryan
  1 sibling, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2023-07-05  6:37 UTC (permalink / raw)
  To: akpm
  Cc: jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, david, dhowells, hughd, bigeasy, kent.overstreet,
	punit.agrawal, lstoakes, peterjung1337, rientjes, chriscli,
	axelrasmussen, joelaf, minchan, rppt, jannh, shakeelb, tatashin,
	edumazet, gthelen, linux-mm, linux-kernel, stable,
	Suren Baghdasaryan

When forking a child process, parent write-protects an anonymous page
and COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to
some stale writable TLB entries targeting the wrong (old) page.
Similar issue happened in the past with userfaultfd (see flush_tlb_page()
call inside do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.

Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 kernel/fork.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index b85814e614a5..d2e12b6d2b18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	for_each_vma(old_vmi, mpnt) {
 		struct file *file;

+		vma_start_write(mpnt);
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;
-- 
2.41.0.255.g8b1d071c50-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] mm: disable CONFIG_PER_VMA_LOCK until its fixed
  2023-07-05  6:37 [PATCH v2 0/2] Avoid memory corruption caused by per-VMA locks Suren Baghdasaryan
  2023-07-05  6:37 ` [PATCH v2 1/2] fork: lock VMAs of the parent process when forking Suren Baghdasaryan
@ 2023-07-05  6:37 ` Suren Baghdasaryan
  1 sibling, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2023-07-05  6:37 UTC (permalink / raw)
  To: akpm
  Cc: jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, david, dhowells, hughd, bigeasy, kent.overstreet,
	punit.agrawal, lstoakes, peterjung1337, rientjes, chriscli,
	axelrasmussen, joelaf, minchan, rppt, jannh, shakeelb, tatashin,
	edumazet, gthelen, linux-mm, linux-kernel, stable,
	Suren Baghdasaryan

A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86.
Disable per-VMA locks config to prevent this issue while the problem is
being investigated. This is expected to be a temporary measure.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com

Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..0abc6c71dd89 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
        def_bool n
 
 config PER_VMA_LOCK
-	def_bool y
+	bool "Enable per-vma locking during page fault handling."
 	depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+	depends on BROKEN
 	help
 	  Allow per-vma locking during page fault handling.
 
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] fork: lock VMAs of the parent process when forking
  2023-07-05  6:37 ` [PATCH v2 1/2] fork: lock VMAs of the parent process when forking Suren Baghdasaryan
@ 2023-07-05  8:08   ` David Hildenbrand
  2023-07-05 16:10     ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2023-07-05  8:08 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, chriscli, axelrasmussen,
	joelaf, minchan, rppt, jannh, shakeelb, tatashin, edumazet,
	gthelen, linux-mm, linux-kernel, stable

On 05.07.23 08:37, Suren Baghdasaryan wrote:
> When forking a child process, parent write-protects an anonymous page
> and COW-shares it with the child being forked using copy_present_pte().
> Parent's TLB is flushed right before we drop the parent's mmap_lock in
> dup_mmap(). If we get a write-fault before that TLB flush in the parent,
> and we end up replacing that anonymous page in the parent process in
> do_wp_page() (because, COW-shared with the child), this might lead to
> some stale writable TLB entries targeting the wrong (old) page.
> Similar issue happened in the past with userfaultfd (see flush_tlb_page()
> call inside do_wp_page()).
> Lock VMAs of the parent process when forking a child, which prevents
> concurrent page faults during fork operation and avoids this issue.
> This fix can potentially regress some fork-heavy workloads. Kernel build
> time did not show noticeable regression on a 56-core machine while a
> stress test mapping 10000 VMAs and forking 5000 times in a tight loop
> shows ~5% regression. If such fork time regression is unacceptable,
> disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> optimizations are possible if this regression proves to be problematic.

Out of interest, did you also populate page tables / pages for some of these
VMAs, or is this primarily looping over 10000 VMAs that don't actually copy any
page tables?

> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
> Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Cc: stable@vger.kernel.org
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>   kernel/fork.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b85814e614a5..d2e12b6d2b18 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   	for_each_vma(old_vmi, mpnt) {
>   		struct file *file;
>   
> +		vma_start_write(mpnt);
>   		if (mpnt->vm_flags & VM_DONTCOPY) {
>   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>   			continue;

After the mmap_write_lock_killable(), there will still be a period where page
faults can happen. Essentially, page faults can happen for a VMA until we lock that VMA.

I cannot immediately name something that is broken allowing for that, and this change
should fix the issue at hand, but exotic things like

	flush_cache_dup_mm(oldmm);

make me wonder if we really want to allow for that or if there is some other corner case
in fork() handling that really doesn't expect concurrent page faults (and, thereby, page
table modifications) with fork.

For example, documentation/core-api/cachetlb.rst says

2) ``void flush_cache_dup_mm(struct mm_struct *mm)``

	This interface flushes an entire user address space from
	the caches.  That is, after running, there will be no cache
	lines associated with 'mm'.

	This interface is used to handle whole address space
	page table operations such as what happens during fork.

	This option is separate from flush_cache_mm to allow some
	optimizations for VIPT caches.


An alternative that requires another VMA walk would be

diff --git a/kernel/fork.c b/kernel/fork.c
index 41c964104b58..0f182d3f049b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -662,6 +662,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
  		retval = -EINTR;
  		goto fail_uprobe_end;
  	}
+
+	/* Disallow any page faults early by locking all VMAs. */
+	if (IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
+		for_each_vma(old_vmi, mpnt)
+			vma_start_write(mpnt);
+		vma_iter_init(old_vmi, old_mm, 0);
+	}
  	flush_cache_dup_mm(oldmm);
  	uprobe_dup_mmap(oldmm, mm);
  	/*
-- 
2.41.0


Unless there are other thoughts, I guess you change is fine regarding the problem
at hand. Not so sure regarding any other corner cases, that's why I'm spelling it out.


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] fork: lock VMAs of the parent process when forking
  2023-07-05  8:08   ` David Hildenbrand
@ 2023-07-05 16:10     ` Suren Baghdasaryan
  2023-07-05 17:14       ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2023-07-05 16:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, chriscli, axelrasmussen,
	joelaf, minchan, rppt, jannh, shakeelb, tatashin, edumazet,
	gthelen, linux-mm, linux-kernel, stable

On Wed, Jul 5, 2023 at 1:08 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.07.23 08:37, Suren Baghdasaryan wrote:
> > When forking a child process, parent write-protects an anonymous page
> > and COW-shares it with the child being forked using copy_present_pte().
> > Parent's TLB is flushed right before we drop the parent's mmap_lock in
> > dup_mmap(). If we get a write-fault before that TLB flush in the parent,
> > and we end up replacing that anonymous page in the parent process in
> > do_wp_page() (because, COW-shared with the child), this might lead to
> > some stale writable TLB entries targeting the wrong (old) page.
> > Similar issue happened in the past with userfaultfd (see flush_tlb_page()
> > call inside do_wp_page()).
> > Lock VMAs of the parent process when forking a child, which prevents
> > concurrent page faults during fork operation and avoids this issue.
> > This fix can potentially regress some fork-heavy workloads. Kernel build
> > time did not show noticeable regression on a 56-core machine while a
> > stress test mapping 10000 VMAs and forking 5000 times in a tight loop
> > shows ~5% regression. If such fork time regression is unacceptable,
> > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> > optimizations are possible if this regression proves to be problematic.
>
> Out of interest, did you also populate page tables / pages for some of these
> VMAs, or is this primarily looping over 10000 VMAs that don't actually copy any
> page tables?

I did not populate the page tables, therefore this represents the
worst case scenario (the share of time used to lock the VMAs is
maximized).

>
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
> > Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
> > Reported-by: Jacob Young <jacobly.alt@gmail.com>
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >   kernel/fork.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b85814e614a5..d2e12b6d2b18 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >       for_each_vma(old_vmi, mpnt) {
> >               struct file *file;
> >
> > +             vma_start_write(mpnt);
> >               if (mpnt->vm_flags & VM_DONTCOPY) {
> >                       vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> >                       continue;
>
> After the mmap_write_lock_killable(), there will still be a period where page
> faults can happen. Essentially, page faults can happen for a VMA until we lock that VMA.
>
> I cannot immediately name something that is broken allowing for that, and this change
> should fix the issue at hand, but exotic things like
>
>         flush_cache_dup_mm(oldmm);
>
> make me wonder if we really want to allow for that or if there is some other corner case
> in fork() handling that really doesn't expect concurrent page faults (and, thereby, page
> table modifications) with fork.
>
> For example, documentation/core-api/cachetlb.rst says
>
> 2) ``void flush_cache_dup_mm(struct mm_struct *mm)``
>
>         This interface flushes an entire user address space from
>         the caches.  That is, after running, there will be no cache
>         lines associated with 'mm'.
>
>         This interface is used to handle whole address space
>         page table operations such as what happens during fork.
>
>         This option is separate from flush_cache_mm to allow some
>         optimizations for VIPT caches.
>

I see. So, we really need to lock all VMAs before
flush_cache_dup_mm(). Makes sense. I'll post an update to this patch
shortly.
Thanks,
Suren.

>
> An alternative that requires another VMA walk would be
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 41c964104b58..0f182d3f049b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -662,6 +662,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                 retval = -EINTR;
>                 goto fail_uprobe_end;
>         }
> +
> +       /* Disallow any page faults early by locking all VMAs. */
> +       if (IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> +               for_each_vma(old_vmi, mpnt)
> +                       vma_start_write(mpnt);
> +               vma_iter_init(old_vmi, old_mm, 0);
> +       }
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> --
> 2.41.0
>
>
> Unless there are other thoughts, I guess you change is fine regarding the problem
> at hand. Not so sure regarding any other corner cases, that's why I'm spelling it out.
>
>
> Acked-by: David Hildenbrand <david@redhat.com>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] fork: lock VMAs of the parent process when forking
  2023-07-05 16:10     ` Suren Baghdasaryan
@ 2023-07-05 17:14       ` Suren Baghdasaryan
  0 siblings, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2023-07-05 17:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, jirislaby, jacobly.alt, holger, hdegoede, michel, jglisse,
	mhocko, vbabka, hannes, mgorman, dave, willy, liam.howlett,
	peterz, ldufour, paulmck, mingo, will, luto, songliubraving,
	peterx, dhowells, hughd, bigeasy, kent.overstreet, punit.agrawal,
	lstoakes, peterjung1337, rientjes, chriscli, axelrasmussen,
	joelaf, minchan, rppt, jannh, shakeelb, tatashin, edumazet,
	gthelen, linux-mm, linux-kernel, stable

On Wed, Jul 5, 2023 at 9:10 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jul 5, 2023 at 1:08 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 05.07.23 08:37, Suren Baghdasaryan wrote:
> > > When forking a child process, parent write-protects an anonymous page
> > > and COW-shares it with the child being forked using copy_present_pte().
> > > Parent's TLB is flushed right before we drop the parent's mmap_lock in
> > > dup_mmap(). If we get a write-fault before that TLB flush in the parent,
> > > and we end up replacing that anonymous page in the parent process in
> > > do_wp_page() (because, COW-shared with the child), this might lead to
> > > some stale writable TLB entries targeting the wrong (old) page.
> > > Similar issue happened in the past with userfaultfd (see flush_tlb_page()
> > > call inside do_wp_page()).
> > > Lock VMAs of the parent process when forking a child, which prevents
> > > concurrent page faults during fork operation and avoids this issue.
> > > This fix can potentially regress some fork-heavy workloads. Kernel build
> > > time did not show noticeable regression on a 56-core machine while a
> > > stress test mapping 10000 VMAs and forking 5000 times in a tight loop
> > > shows ~5% regression. If such fork time regression is unacceptable,
> > > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> > > optimizations are possible if this regression proves to be problematic.
> >
> > Out of interest, did you also populate page tables / pages for some of these
> > VMAs, or is this primarily looping over 10000 VMAs that don't actually copy any
> > page tables?
>
> I did not populate the page tables, therefore this represents the
> worst case scenario (the share of time used to lock the VMAs is
> maximized).
>
> >
> > >
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > Reported-by: Jiri Slaby <jirislaby@kernel.org>
> > > Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
> > > Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > > Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
> > > Reported-by: Jacob Young <jacobly.alt@gmail.com>
> > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >   kernel/fork.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > >
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index b85814e614a5..d2e12b6d2b18 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >       for_each_vma(old_vmi, mpnt) {
> > >               struct file *file;
> > >
> > > +             vma_start_write(mpnt);
> > >               if (mpnt->vm_flags & VM_DONTCOPY) {
> > >                       vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > >                       continue;
> >
> > After the mmap_write_lock_killable(), there will still be a period where page
> > faults can happen. Essentially, page faults can happen for a VMA until we lock that VMA.
> >
> > I cannot immediately name something that is broken allowing for that, and this change
> > should fix the issue at hand, but exotic things like
> >
> >         flush_cache_dup_mm(oldmm);
> >
> > make me wonder if we really want to allow for that or if there is some other corner case
> > in fork() handling that really doesn't expect concurrent page faults (and, thereby, page
> > table modifications) with fork.
> >
> > For example, documentation/core-api/cachetlb.rst says
> >
> > 2) ``void flush_cache_dup_mm(struct mm_struct *mm)``
> >
> >         This interface flushes an entire user address space from
> >         the caches.  That is, after running, there will be no cache
> >         lines associated with 'mm'.
> >
> >         This interface is used to handle whole address space
> >         page table operations such as what happens during fork.
> >
> >         This option is separate from flush_cache_mm to allow some
> >         optimizations for VIPT caches.
> >
>
> I see. So, we really need to lock all VMAs before
> flush_cache_dup_mm(). Makes sense. I'll post an update to this patch
> shortly.

v3 of the patchset with this fix is posted at
https://lore.kernel.org/all/20230705171213.2843068-1-surenb@google.com/

> Thanks,
> Suren.
>
> >
> > An alternative that requires another VMA walk would be
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 41c964104b58..0f182d3f049b 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -662,6 +662,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >                 retval = -EINTR;
> >                 goto fail_uprobe_end;
> >         }
> > +
> > +       /* Disallow any page faults early by locking all VMAs. */
> > +       if (IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> > +               for_each_vma(old_vmi, mpnt)
> > +                       vma_start_write(mpnt);
> > +               vma_iter_init(old_vmi, old_mm, 0);
> > +       }
> >         flush_cache_dup_mm(oldmm);
> >         uprobe_dup_mmap(oldmm, mm);
> >         /*
> > --
> > 2.41.0
> >
> >
> > Unless there are other thoughts, I guess you change is fine regarding the problem
> > at hand. Not so sure regarding any other corner cases, that's why I'm spelling it out.
> >
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-05 17:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-05  6:37 [PATCH v2 0/2] Avoid memory corruption caused by per-VMA locks Suren Baghdasaryan
2023-07-05  6:37 ` [PATCH v2 1/2] fork: lock VMAs of the parent process when forking Suren Baghdasaryan
2023-07-05  8:08   ` David Hildenbrand
2023-07-05 16:10     ` Suren Baghdasaryan
2023-07-05 17:14       ` Suren Baghdasaryan
2023-07-05  6:37 ` [PATCH v2 2/2] mm: disable CONFIG_PER_VMA_LOCK until its fixed Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox