[PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
@ 2025-04-22 14:49 David Hildenbrand
  2025-04-22 14:54 ` David Hildenbrand
  2025-04-23 14:41 ` Lorenzo Stoakes
  0 siblings, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-04-22 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, David Hildenbrand, Andrew Morton, Lorenzo Stoakes,
	Ingo Molnar, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, Rik van Riel, H. Peter Anvin,
	Linus Torvalds

Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
used for duplicating VMAs during fork(), but also for duplicating VMAs
when splitting VMAs or when mremap()'ing them.

VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.

In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation. So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.

Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA. Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.

So when unmapping B, the reservation would be updated to cover only A. When
unmapping A, we would properly remove the now-shrunk reservation. That
scenario describes the mremap() shrinking (old_size > new_size), where
we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.

What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
first? It would be broken because we would never free the reservation.
Likely, there are ways to trigger such a VMA split outside of mremap().

Affecting other VMA duplication was not intended, vm_area_dup() being
used outside of kernel/fork.c was an oversight. So let's fix that for;
how to handle VMA splits better should be investigated separately.

This was found by code inspection only, while staring at yet another
VM_PAT problem.

Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

This VM_PAT code really wants me to scream at my computer. So far it didn't
succeed, but I am close. Well, at least now I understand how it interacts
with VMA splitting ...

---
 kernel/fork.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8..168681fc4b25a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,10 +498,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	vma_numab_state_init(new);
 	dup_anon_vma_name(orig, new);

-	/* track_pfn_copy() will later take care of copying internal state. */
-	if (unlikely(new->vm_flags & VM_PFNMAP))
-		untrack_pfn_clear(new);
-
 	return new;
 }

@@ -672,6 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		tmp = vm_area_dup(mpnt);
 		if (!tmp)
 			goto fail_nomem;
+
+		/* track_pfn_copy() will later take care of copying internal state. */
+		if (unlikely(tmp->vm_flags & VM_PFNMAP))
+			untrack_pfn_clear(tmp);
+
 		retval = vma_dup_policy(mpnt, tmp);
 		if (retval)
 			goto fail_nomem_policy;
-- 
2.49.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-22 14:49 [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() David Hildenbrand
@ 2025-04-22 14:54 ` David Hildenbrand
  2025-04-23 14:42   ` Lorenzo Stoakes
  2025-04-23 14:41 ` Lorenzo Stoakes
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-04-22 14:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, Andrew Morton, Lorenzo Stoakes, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On 22.04.25 16:49, David Hildenbrand wrote:
> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> used for duplicating VMAs during fork(), but also for duplicating VMAs
> when splitting VMAs or when mremap()'ing them.
> 
> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> size) and apparently also shrunk during mremap(), which implies
> duplicating the VMA in __split_vma() first.
> 
> In case of ordinary mremap() (no change in size), we first duplicate the
> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> the old VMA: we effectively move the VM_PAT reservation. So the
> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> context.
> 
> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> reservation when splitting the VMA. Instead, in memtype_erase() -- called
> during zapping/munmap -- we shrink a reservation in case only the end
> address matches: Assume we split a VMA into A and B, both would share a
> reservation until B is unmapped.
> 
> So when unmapping B, the reservation would be updated to cover only A. When
> unmapping A, we would properly remove the now-shrunk reservation. That
> scenario describes the mremap() shrinking (old_size > new_size), where
> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> is wrong.
> 
> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> first? It would be broken because we would never free the reservation.
> Likely, there are ways to trigger such a VMA split outside of mremap().

As expected ... with a simple reproducer that uses mprotect() to split 
such a VMA I can trigger

x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 
0x00000000-0x00000fff]


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-22 14:49 [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() David Hildenbrand
  2025-04-22 14:54 ` David Hildenbrand
@ 2025-04-23 14:41 ` Lorenzo Stoakes
  2025-04-23 19:30   ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-04-23 14:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

+cc Liam for the vma things, and because he adores PAT stuff ;)

On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> used for duplicating VMAs during fork(), but also for duplicating VMAs
> when splitting VMAs or when mremap()'ing them.

Ugh this sucks, I really want to move a bunch of this stuff out of the fork
code. we have some really nasty overlap there.

This definitely needs to be separate out. Perhaps I can take a look at
that...

>
> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> size) and apparently also shrunk during mremap(), which implies
> duplicating the VMA in __split_vma() first.

Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
(implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
size of the VMA.

(as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
disallowing MREMAP_DONTUNMAP).

Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
can't be faulted in, and thus we can't have unfaulted backing for it, but
we can shrink safely.

>
> In case of ordinary mremap() (no change in size), we first duplicate the
> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> the old VMA: we effectively move the VM_PAT reservation. So the
> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> context.
>

OK so we do:

copy_vma_and_data()
-> copy_vma()
-> vm_area_dup()
-> untrack_pfn_clear(new vma)

And:

copy_vma_and_data()
-> untrack_pfn_clear(old vma)

So we end up with... neither tracked. Fun.

Agreed this is incorrect.

> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> reservation when splitting the VMA. Instead, in memtype_erase() -- called
> during zapping/munmap -- we shrink a reservation in case only the end
> address matches: Assume we split a VMA into A and B, both would share a
> reservation until B is unmapped.

Glorious. I really oppose us making radical changes to splitting logic to
suit this one x86-specific feature.

Seems to me the change should be within PAT...

>
> So when unmapping B, the reservation would be updated to cover only A. When
> unmapping A, we would properly remove the now-shrunk reservation. That
> scenario describes the mremap() shrinking (old_size > new_size), where
> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> is wrong.
>
> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> first? It would be broken because we would never free the reservation.
> Likely, there are ways to trigger such a VMA split outside of mremap().

This must have been a problem that already existed, and this is really
quite plausible in reality, which makes me wonder, again, how much PAT is
actually used in the wild.

I may be mistaken of course (happy to be corrected) and it's used very
heavily and somehow this scenario doesn't occur.

We should definitely add a test for this anyway :) even if 'skip' for now
while we figure out how to fix it (_without egregiously impacting split
code_).

This can't happen in mremap() in any case as we are unmapping which happens
to split then always remove the latter VMA.

OK so to get back into the split logic, this is:

shrink_vma() (in mm/mremap.c)
-> do_vmi_munmap()
-> do_vmi_align_munmap()
-> vms_gather_munmap_vmas()
-> __split_vma()
(later called from do_vmi_align_munmap())
-> vms_complete_munmap_vmas()
-> vms_clear_ptes()
-> unmap_vmas()
-> unmap_single_vma()
-> untrack_pfn()
-> free_pfn_range()
-> memtype_free()
-> memtype_erase()

As simple as that! ;)

Makes sense.

>
> Affecting other VMA duplication was not intended, vm_area_dup() being
> used outside of kernel/fork.c was an oversight. So let's fix that for;
> how to handle VMA splits better should be investigated separately.

To reiterate, I think this should be handled within PAT itself, rather than
result in changes to VMA code, unless it results in us adding sensible
hooks there.

>
> This was found by code inspection only, while staring at yet another
> VM_PAT problem.

My sympathies...

>
> Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Anyway, having gone through your excellent descriptions (albeit, feeling
the same pain as you did :P), and looking at the logic, I agree this patch
is correct, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>
> This VM_PAT code really wants me to scream at my computer. So far it didn't
> succeed, but I am close. Well, at least now I understand how it interacts
> with VMA splitting ...

Well, I'm also quite scared of it, and indeed also relate ;) How heavily
used is PAT? We do seem to constantly run into problems with getting it to
behave itself wrt VMA code.

We recently had to remove some quite egregious hooks in VMA code which was
a pain, is there a better way of doing this?

I really do hate this 'randomly call a function in various spots and do
something specific for feature X' pattern that we use for hugetlb, uffd,
this, and other stuff.

We need to use (carefully constrained!) generic hooks for this kind of
thing.

At least tell me the mremap() refactor I did made this less horrible? ;)

>
> ---
>  kernel/fork.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b8..168681fc4b25a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -498,10 +498,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	vma_numab_state_init(new);
>  	dup_anon_vma_name(orig, new);
>
> -	/* track_pfn_copy() will later take care of copying internal state. */
> -	if (unlikely(new->vm_flags & VM_PFNMAP))
> -		untrack_pfn_clear(new);
> -
>  	return new;
>  }
>
> @@ -672,6 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		tmp = vm_area_dup(mpnt);
>  		if (!tmp)
>  			goto fail_nomem;
> +
> +		/* track_pfn_copy() will later take care of copying internal state. */
> +		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> +			untrack_pfn_clear(tmp);
> +
>  		retval = vma_dup_policy(mpnt, tmp);
>  		if (retval)
>  			goto fail_nomem_policy;
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-22 14:54 ` David Hildenbrand
@ 2025-04-23 14:42   ` Lorenzo Stoakes
  2025-04-23 15:30     ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-04-23 14:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On Tue, Apr 22, 2025 at 04:54:54PM +0200, David Hildenbrand wrote:
> On 22.04.25 16:49, David Hildenbrand wrote:
> > Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> > used for duplicating VMAs during fork(), but also for duplicating VMAs
> > when splitting VMAs or when mremap()'ing them.
> >
> > VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> > size) and apparently also shrunk during mremap(), which implies
> > duplicating the VMA in __split_vma() first.
> >
> > In case of ordinary mremap() (no change in size), we first duplicate the
> > VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> > the old VMA: we effectively move the VM_PAT reservation. So the
> > untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> > context.
> >
> > Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> > reservation when splitting the VMA. Instead, in memtype_erase() -- called
> > during zapping/munmap -- we shrink a reservation in case only the end
> > address matches: Assume we split a VMA into A and B, both would share a
> > reservation until B is unmapped.
> >
> > So when unmapping B, the reservation would be updated to cover only A. When
> > unmapping A, we would properly remove the now-shrunk reservation. That
> > scenario describes the mremap() shrinking (old_size > new_size), where
> > we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> > is wrong.
> >
> > What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> > first? It would be broken because we would never free the reservation.
> > Likely, there are ways to trigger such a VMA split outside of mremap().
>
> As expected ... with a simple reproducer that uses mprotect() to split such
> a VMA I can trigger
>
> x86/PAT: pat_mremap:26448 freeing invalid memtype [mem
> 0x00000000-0x00000fff]

Wow.

Might be worth adding a self test for this if not too difficult, even if as
a skipped one w/comment just so we have it ready to go?

>
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-23 14:42   ` Lorenzo Stoakes
@ 2025-04-23 15:30     ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-04-23 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On 23.04.25 16:42, Lorenzo Stoakes wrote:
> On Tue, Apr 22, 2025 at 04:54:54PM +0200, David Hildenbrand wrote:
>> On 22.04.25 16:49, David Hildenbrand wrote:
>>> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
>>> used for duplicating VMAs during fork(), but also for duplicating VMAs
>>> when splitting VMAs or when mremap()'ing them.
>>>
>>> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
>>> size) and apparently also shrunk during mremap(), which implies
>>> duplicating the VMA in __split_vma() first.
>>>
>>> In case of ordinary mremap() (no change in size), we first duplicate the
>>> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
>>> the old VMA: we effectively move the VM_PAT reservation. So the
>>> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
>>> context.
>>>
>>> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
>>> reservation when splitting the VMA. Instead, in memtype_erase() -- called
>>> during zapping/munmap -- we shrink a reservation in case only the end
>>> address matches: Assume we split a VMA into A and B, both would share a
>>> reservation until B is unmapped.
>>>
>>> So when unmapping B, the reservation would be updated to cover only A. When
>>> unmapping A, we would properly remove the now-shrunk reservation. That
>>> scenario describes the mremap() shrinking (old_size > new_size), where
>>> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
>>> is wrong.
>>>
>>> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
>>> first? It would be broken because we would never free the reservation.
>>> Likely, there are ways to trigger such a VMA split outside of mremap().
>>
>> As expected ... with a simple reproducer that uses mprotect() to split such
>> a VMA I can trigger
>>
>> x86/PAT: pat_mremap:26448 freeing invalid memtype [mem
>> 0x00000000-0x00000fff]
> 
> Wow.
> 
> Might be worth adding a self test for this if not too difficult, even if as
> a skipped one w/comment just so we have it ready to go?

Yeah, I think it's as stupid as mmaping two pages worth of /dev/mem, and 
then munmapping only a single one. It's so easy to trigger that it makes 
me absolutely angry.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-23 14:41 ` Lorenzo Stoakes
@ 2025-04-23 19:30   ` David Hildenbrand
  2025-04-24  7:21     ` Lorenzo Stoakes
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-04-23 19:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On 23.04.25 16:41, Lorenzo Stoakes wrote:
> +cc Liam for the vma things, and because he adores PAT stuff ;)
> 
> On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
>> Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
>> used for duplicating VMAs during fork(), but also for duplicating VMAs
>> when splitting VMAs or when mremap()'ing them.
> 
> Ugh this sucks, I really want to move a bunch of this stuff out of the fork
> code. we have some really nasty overlap there.
> 
> This definitely needs to be separate out. Perhaps I can take a look at
> that...

Yes, it should likely live in ... vma.c ? :)

> 
>>
>> VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
>> size) and apparently also shrunk during mremap(), which implies
>> duplicating the VMA in __split_vma() first.
> 
> Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
> (implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
> size of the VMA.
> 
> (as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
> disallowing MREMAP_DONTUNMAP).
> 
> Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
> can't be faulted in, and thus we can't have unfaulted backing for it, but
> we can shrink safely.
> 
>>
>> In case of ordinary mremap() (no change in size), we first duplicate the
>> VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
>> the old VMA: we effectively move the VM_PAT reservation. So the
>> untrack_pfn_clear() call on the new VMA duplicating is wrong in that
>> context.
>>
> 
> OK so we do:
> 
> copy_vma_and_data()
> -> copy_vma()
> -> vm_area_dup()
> -> untrack_pfn_clear(new vma)
> 
> And:
> 
> copy_vma_and_data()
> -> untrack_pfn_clear(old vma)
> 
> So we end up with... neither tracked. Fun.
> 
> Agreed this is incorrect.
> 
>> Splitting of VMAs seems problematic, because we don't duplicate/adjust the
>> reservation when splitting the VMA. Instead, in memtype_erase() -- called
>> during zapping/munmap -- we shrink a reservation in case only the end
>> address matches: Assume we split a VMA into A and B, both would share a
>> reservation until B is unmapped.
> 
> Glorious. I really oppose us making radical changes to splitting logic to
> suit this one x86-specific feature.
> 
> Seems to me the change should be within PAT...

Yeah ...

> 
>>
>> So when unmapping B, the reservation would be updated to cover only A. When
>> unmapping A, we would properly remove the now-shrunk reservation. That
>> scenario describes the mremap() shrinking (old_size > new_size), where
>> we split + unmap B, and the untrack_pfn_clear() on the new VMA when
>> is wrong.
>>
>> What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
>> first? It would be broken because we would never free the reservation.
>> Likely, there are ways to trigger such a VMA split outside of mremap().
> 
> This must have been a problem that already existed, and this is really
> quite plausible in reality, which makes me wonder, again, how much PAT is
> actually used in the wild.

Probably people use it all the time, but we only fix stuff that we hit 
in practice.

Fun. :)

> 
> I may be mistaken of course (happy to be corrected) and it's used very
> heavily and somehow this scenario doesn't occur.
> 
> We should definitely add a test for this anyway :) even if 'skip' for now
> while we figure out how to fix it (_without egregiously impacting split
> code_).
> 
> This can't happen in mremap() in any case as we are unmapping which happens
> to split then always remove the latter VMA.
> 
> OK so to get back into the split logic, this is:
> 
> shrink_vma() (in mm/mremap.c)
> -> do_vmi_munmap()
> -> do_vmi_align_munmap()
> -> vms_gather_munmap_vmas()
> -> __split_vma()
> (later called from do_vmi_align_munmap())
> -> vms_complete_munmap_vmas()
> -> vms_clear_ptes()
> -> unmap_vmas()
> -> unmap_single_vma()
> -> untrack_pfn()
> -> free_pfn_range()
> -> memtype_free()
> -> memtype_erase()
> 
> As simple as that! ;)

My head started spinning when I followed that callchain :D

> 
> Makes sense.
> 
>>
>> Affecting other VMA duplication was not intended, vm_area_dup() being
>> used outside of kernel/fork.c was an oversight. So let's fix that for;
>> how to handle VMA splits better should be investigated separately.
> 
> To reiterate, I think this should be handled within PAT itself, rather than
> result in changes to VMA code, unless it results in us adding sensible
> hooks there.

Yes. I wish we could remove it; I'm afraid we can;t

[...]

> Anyway, having gone through your excellent descriptions (albeit, feeling
> the same pain as you did :P), and looking at the logic, I agree this patch
> is correct, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>
>> This VM_PAT code really wants me to scream at my computer. So far it didn't
>> succeed, but I am close. Well, at least now I understand how it interacts
>> with VMA splitting ...
> 
> Well, I'm also quite scared of it, and indeed also relate ;) How heavily
> used is PAT? We do seem to constantly run into problems with getting it to
> behave itself wrt VMA code.
> 
> We recently had to remove some quite egregious hooks in VMA code which was
> a pain, is there a better way of doing this?
> 
> I really do hate this 'randomly call a function in various spots and do
> something specific for feature X' pattern that we use for hugetlb, uffd,
> this, and other stuff.

I hate it.

Probably the right way of attaching such metadata to a VMA would be 
remembering it alongside the VMA in a very simple way.

For example, when we perform a reservation we would allocate a 
refcounted object and assign it to the VMA (pointer, xarray, whatever).

Duplicating the VMA would increase the refcount. Freeing a VMA would 
decrease the refcount.

Once the refcount goes to zero, we undo the reservation and free the object.

We would not adjust a reservation on partial VMA unmap (split + unmap A 
or B), but I strongly assume that would just be fine as long as we undo 
the reservation once the refcount goes to 0.

There is one complication: this nasty memremap.c that abuses it for !vma 
handling. But that could be split out and handled differently.

> 
> We need to use (carefully constrained!) generic hooks for this kind of
> thing.
> 
> At least tell me the mremap() refactor I did made this less horrible? ;)


haha, absolutely :P


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-23 19:30   ` David Hildenbrand
@ 2025-04-24  7:21     ` Lorenzo Stoakes
  2025-04-24  8:45       ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24  7:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On Wed, Apr 23, 2025 at 09:30:09PM +0200, David Hildenbrand wrote:
> On 23.04.25 16:41, Lorenzo Stoakes wrote:
> > +cc Liam for the vma things, and because he adores PAT stuff ;)
> >
> > On Tue, Apr 22, 2025 at 04:49:42PM +0200, David Hildenbrand wrote:
> > > Not intuitive, but vm_area_dup() located in kernel/fork.c is not only
> > > used for duplicating VMAs during fork(), but also for duplicating VMAs
> > > when splitting VMAs or when mremap()'ing them.
> >
> > Ugh this sucks, I really want to move a bunch of this stuff out of the fork
> > code. we have some really nasty overlap there.
> >
> > This definitely needs to be separate out. Perhaps I can take a look at
> > that...
>
> Yes, it should likely live in ... vma.c ? :)

Yes indeed this would make most sense except if fork needs it vma.c/vma.h
is (very intentionally) mm-internal so sadly can't live there in this case
:(

Anyway on my todo will find a place for it in mm if possible... :)

>
> >
> > >
> > > VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
> > > size) and apparently also shrunk during mremap(), which implies
> > > duplicating the VMA in __split_vma() first.
> >
> > Yes, it appears we only disallow VM_PFNMAP on a remap if it is MREMAP_FIXED
> > (implies MREMAP_MAYMOVE) to a new specific address _and_ we _increase_ the
> > size of the VMA.
> >
> > (as determined by vrm_implies_new_addr(), with resize_is_valid() explicitly
> > disallowing MREMAP_DONTUNMAP).
> >
> > Makes sense as VM_PFNMAP implies we map non-vm_normal_folio() stuff, which
> > can't be faulted in, and thus we can't have unfaulted backing for it, but
> > we can shrink safely.
> >
> > >
> > > In case of ordinary mremap() (no change in size), we first duplicate the
> > > VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
> > > the old VMA: we effectively move the VM_PAT reservation. So the
> > > untrack_pfn_clear() call on the new VMA duplicating is wrong in that
> > > context.
> > >
> >
> > OK so we do:
> >
> > copy_vma_and_data()
> > -> copy_vma()
> > -> vm_area_dup()
> > -> untrack_pfn_clear(new vma)
> >
> > And:
> >
> > copy_vma_and_data()
> > -> untrack_pfn_clear(old vma)
> >
> > So we end up with... neither tracked. Fun.
> >
> > Agreed this is incorrect.
> >
> > > Splitting of VMAs seems problematic, because we don't duplicate/adjust the
> > > reservation when splitting the VMA. Instead, in memtype_erase() -- called
> > > during zapping/munmap -- we shrink a reservation in case only the end
> > > address matches: Assume we split a VMA into A and B, both would share a
> > > reservation until B is unmapped.
> >
> > Glorious. I really oppose us making radical changes to splitting logic to
> > suit this one x86-specific feature.
> >
> > Seems to me the change should be within PAT...
>
> Yeah ...
>
> >
> > >
> > > So when unmapping B, the reservation would be updated to cover only A. When
> > > unmapping A, we would properly remove the now-shrunk reservation. That
> > > scenario describes the mremap() shrinking (old_size > new_size), where
> > > we split + unmap B, and the untrack_pfn_clear() on the new VMA when
> > > is wrong.
> > >
> > > What if we manage to split a VM_PFNMAP VMA into A and B and unmap A
> > > first? It would be broken because we would never free the reservation.
> > > Likely, there are ways to trigger such a VMA split outside of mremap().
> >
> > This must have been a problem that already existed, and this is really
> > quite plausible in reality, which makes me wonder, again, how much PAT is
> > actually used in the wild.
>
> Probably people use it all the time, but we only fix stuff that we hit in
> practice.
>
> Fun. :)
>
> >
> > I may be mistaken of course (happy to be corrected) and it's used very
> > heavily and somehow this scenario doesn't occur.
> >
> > We should definitely add a test for this anyway :) even if 'skip' for now
> > while we figure out how to fix it (_without egregiously impacting split
> > code_).
> >
> > This can't happen in mremap() in any case as we are unmapping which happens
> > to split then always remove the latter VMA.
> >
> > OK so to get back into the split logic, this is:
> >
> > shrink_vma() (in mm/mremap.c)
> > -> do_vmi_munmap()
> > -> do_vmi_align_munmap()
> > -> vms_gather_munmap_vmas()
> > -> __split_vma()
> > (later called from do_vmi_align_munmap())
> > -> vms_complete_munmap_vmas()
> > -> vms_clear_ptes()
> > -> unmap_vmas()
> > -> unmap_single_vma()
> > -> untrack_pfn()
> > -> free_pfn_range()
> > -> memtype_free()
> > -> memtype_erase()
> >
> > As simple as that! ;)
>
> My head started spinning when I followed that callchain :D
>
> >
> > Makes sense.
> >
> > >
> > > Affecting other VMA duplication was not intended, vm_area_dup() being
> > > used outside of kernel/fork.c was an oversight. So let's fix that for;
> > > how to handle VMA splits better should be investigated separately.
> >
> > To reiterate, I think this should be handled within PAT itself, rather than
> > result in changes to VMA code, unless it results in us adding sensible
> > hooks there.
>
> Yes. I wish we could remove it; I'm afraid we can;t

Sigh one of these... I guess I was being hopeful!

>
> [...]
>
> > Anyway, having gone through your excellent descriptions (albeit, feeling
> > the same pain as you did :P), and looking at the logic, I agree this patch
> > is correct, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!

No probs!

>
> >
> > > ---
> > >
> > > This VM_PAT code really wants me to scream at my computer. So far it didn't
> > > succeed, but I am close. Well, at least now I understand how it interacts
> > > with VMA splitting ...
> >
> > Well, I'm also quite scared of it, and indeed also relate ;) How heavily
> > used is PAT? We do seem to constantly run into problems with getting it to
> > behave itself wrt VMA code.
> >
> > We recently had to remove some quite egregious hooks in VMA code which was
> > a pain, is there a better way of doing this?
> >
> > I really do hate this 'randomly call a function in various spots and do
> > something specific for feature X' pattern that we use for hugetlb, uffd,
> > this, and other stuff.
>
> I hate it.
>
> Probably the right way of attaching such metadata to a VMA would be
> remembering it alongside the VMA in a very simple way.
>
> For example, when we perform a reservation we would allocate a refcounted
> object and assign it to the VMA (pointer, xarray, whatever).
>
> Duplicating the VMA would increase the refcount. Freeing a VMA would
> decrease the refcount.
>
> Once the refcount goes to zero, we undo the reservation and free the object.
>
> We would not adjust a reservation on partial VMA unmap (split + unmap A or
> B), but I strongly assume that would just be fine as long as we undo the
> reservation once the refcount goes to 0.

Yeah this is a really good idea actually, almost kinda what refcounts are
for haha...

The problem is we talk about this idly here, but neither of us wants to
actually write PAT code I'd say, so this may go nowhere. But maybe one of
us will get so frustrated that we do this anyway but still...

Then again - actually, is this something you are planning to tackle?

It's curious it's not come up before, maybe not usual for this to happen
with VM_PFNMAP PAT mappings.

>
> There is one complication: this nasty memremap.c that abuses it for !vma
> handling. But that could be split out and handled differently.

I've not really looked into this but pre-yuck...

>
> >
> > We need to use (carefully constrained!) generic hooks for this kind of
> > thing.
> >
> > At least tell me the mremap() refactor I did made this less horrible? ;)
>
>
> haha, absolutely :P

Good :)

>
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-24  7:21     ` Lorenzo Stoakes
@ 2025-04-24  8:45       ` David Hildenbrand
  2025-04-24  9:49         ` Lorenzo Stoakes
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-04-24  8:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

>> Probably the right way of attaching such metadata to a VMA would be
>> remembering it alongside the VMA in a very simple way.
>>
>> For example, when we perform a reservation we would allocate a refcounted
>> object and assign it to the VMA (pointer, xarray, whatever).
>>
>> Duplicating the VMA would increase the refcount. Freeing a VMA would
>> decrease the refcount.
>>
>> Once the refcount goes to zero, we undo the reservation and free the object.
>>
>> We would not adjust a reservation on partial VMA unmap (split + unmap A or
>> B), but I strongly assume that would just be fine as long as we undo the
>> reservation once the refcount goes to 0.
> 
> Yeah this is a really good idea actually, almost kinda what refcounts are
> for haha...
> 
> The problem is we talk about this idly here, but neither of us wants to
> actually write PAT code I'd say, so this may go nowhere. But maybe one of
> us will get so frustrated that we do this anyway but still...
> 
> Then again - actually, is this something you are planning to tackle?

I hate this much with that much passion that I'll give it a try for a 
couple of hours, as it might fix the other issues we are seeing.  So far 
it looks like it cleans up stuff *beautifully*. Even VM_PAT can go ... :)

... and I think we still have space in vm_area_struct without increasing 
it beyond 192 bytes.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-24  8:45       ` David Hildenbrand
@ 2025-04-24  9:49         ` Lorenzo Stoakes
  2025-04-24 12:33           ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24  9:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On Thu, Apr 24, 2025 at 10:45:36AM +0200, David Hildenbrand wrote:
> > > Probably the right way of attaching such metadata to a VMA would be
> > > remembering it alongside the VMA in a very simple way.
> > >
> > > For example, when we perform a reservation we would allocate a refcounted
> > > object and assign it to the VMA (pointer, xarray, whatever).
> > >
> > > Duplicating the VMA would increase the refcount. Freeing a VMA would
> > > decrease the refcount.
> > >
> > > Once the refcount goes to zero, we undo the reservation and free the object.
> > >
> > > We would not adjust a reservation on partial VMA unmap (split + unmap A or
> > > B), but I strongly assume that would just be fine as long as we undo the
> > > reservation once the refcount goes to 0.
> >
> > Yeah this is a really good idea actually, almost kinda what refcounts are
> > for haha...
> >
> > The problem is we talk about this idly here, but neither of us wants to
> > actually write PAT code I'd say, so this may go nowhere. But maybe one of
> > us will get so frustrated that we do this anyway but still...
> >
> > Then again - actually, is this something you are planning to tackle?
>
> I hate this much with that much passion that I'll give it a try for a couple
> of hours, as it might fix the other issues we are seeing.  So far it looks
> like it cleans up stuff *beautifully*. Even VM_PAT can go ... :)

To quote a film, let the hate flow through you :P

I mean I am very familiar with this - vma merge, anon_vma, etc. there's a
pattern... ;)

>
> ... and I think we still have space in vm_area_struct without increasing it
> beyond 192 bytes.

Hm, so you're thinking of a general field in the VMA? I thought this would
belong to the PAT object somehow?

Though getting rid of VM_PAT would be fantastic...

I wonder if a _general_ VMA ref count would be a bit much just for this
case.

But maybe I misunderstand your approach :) Happy to obviously look and if
not like some crazy thing just for PAT (you can understand why I would not
like this) will be supportive :>)


>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-24  9:49         ` Lorenzo Stoakes
@ 2025-04-24 12:33           ` David Hildenbrand
  2025-04-24 12:49             ` Lorenzo Stoakes
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-04-24 12:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

>>
>> ... and I think we still have space in vm_area_struct without increasing it
>> beyond 192 bytes.
> 
> Hm, so you're thinking of a general field in the VMA? I thought this would
> belong to the PAT object somehow?

It's glued to a VMA. The only alternative to using a VMA field would be looking it
up for a VMA, storing it in an xarray etc ... ends up complicating stuff when there
is no need to right now.

> 
> Though getting rid of VM_PAT would be fantastic...
> 
> I wonder if a _general_ VMA ref count would be a bit much just for this
> case.

I don't think it would be helpful for this case. It's much more similar to the anon
VMA name (that also has its own kref)

> 
> But maybe I misunderstand your approach :) Happy to obviously look and if
> not like some crazy thing just for PAT (you can understand why I would not
> like this) will be supportive :>)

This is something quick (well, longer than I wish it would take) that seems to
work. There are smaller pat-internal cleanups to be had on top of this, and
the new functions shall be documented.


Observe how:
* We remove VM_PAT and that weird VM flags manipulation + "locked" flag
* We remove any traces of the nasty tracking handling from mremap+fork code
* Just like anon_vma_name, it hooks into vm_area_dup()/vm_area_free().
* We remove the page table lookup via get_pat_info()->... completely
* We remove the VMA parameter from PAT code completely
* We reduce the track/untrack/sanitize interface to 3 functions

 From 4cf8b2a2e60220c5b438adf920d75cba3af50ab4 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Thu, 24 Apr 2025 12:06:15 +0200
Subject: [PATCH] mm: rewrite pfnmap tracking

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  arch/x86/mm/pat/memtype.c      | 155 ++-------------------------------
  drivers/gpu/drm/i915/i915_mm.c |   4 +-
  include/linux/mm.h             |   4 +-
  include/linux/mm_inline.h      |   2 +
  include/linux/mm_types.h       |  11 +++
  include/linux/pgtable.h        |  71 ++-------------
  include/trace/events/mmflags.h |   4 +-
  kernel/fork.c                  |  54 ++++++++++--
  mm/huge_memory.c               |   7 +-
  mm/io-mapping.c                |   2 +-
  mm/memory.c                    |  93 +++++++++++++++-----
  mm/memremap.c                  |   8 +-
  mm/mremap.c                    |   4 -
  13 files changed, 162 insertions(+), 257 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 72d8cbc611583..237c7e5e9d9aa 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -932,124 +932,14 @@ static void free_pfn_range(u64 paddr, unsigned long size)
  		memtype_free(paddr, paddr + size);
  }
  
-static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
-		resource_size_t *phys)
-{
-	struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
-
-	if (follow_pfnmap_start(&args))
-		return -EINVAL;
-
-	/* Never return PFNs of anon folios in COW mappings. */
-	if (!args.special) {
-		follow_pfnmap_end(&args);
-		return -EINVAL;
-	}
-
-	*prot = pgprot_val(args.pgprot);
-	*phys = (resource_size_t)args.pfn << PAGE_SHIFT;
-	follow_pfnmap_end(&args);
-	return 0;
-}
-
-static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
-		pgprot_t *pgprot)
-{
-	unsigned long prot;
-
-	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
-
-	/*
-	 * We need the starting PFN and cachemode used for track_pfn_remap()
-	 * that covered the whole VMA. For most mappings, we can obtain that
-	 * information from the page tables. For COW mappings, we might now
-	 * suddenly have anon folios mapped and follow_phys() will fail.
-	 *
-	 * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
-	 * detect the PFN. If we need the cachemode as well, we're out of luck
-	 * for now and have to fail fork().
-	 */
-	if (!follow_phys(vma, &prot, paddr)) {
-		if (pgprot)
-			*pgprot = __pgprot(prot);
-		return 0;
-	}
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (pgprot)
-			return -EINVAL;
-		*paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
-		return 0;
-	}
-	WARN_ON_ONCE(1);
-	return -EINVAL;
-}
-
-int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn)
-{
-	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
-	resource_size_t paddr;
-	pgprot_t pgprot;
-	int rc;
-
-	if (!(src_vma->vm_flags & VM_PAT))
-		return 0;
-
-	/*
-	 * Duplicate the PAT information for the dst VMA based on the src
-	 * VMA.
-	 */
-	if (get_pat_info(src_vma, &paddr, &pgprot))
-		return -EINVAL;
-	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
-	if (rc)
-		return rc;
-
-	/* Reservation for the destination VMA succeeded. */
-	vm_flags_set(dst_vma, VM_PAT);
-	*pfn = PHYS_PFN(paddr);
-	return 0;
-}
-
-void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
-{
-	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
-	/*
-	 * Reservation was freed, any copied page tables will get cleaned
-	 * up later, but without getting PAT involved again.
-	 */
-}
-
-/*
- * prot is passed in as a parameter for the new mapping. If the vma has
- * a linear pfn mapping for the entire range, or no vma is provided,
- * reserve the entire pfn + size range with single reserve_pfn_range
- * call.
- */
-int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-		    unsigned long pfn, unsigned long addr, unsigned long size)
+int pfnmap_sanitize(unsigned long pfn, unsigned long size, pgprot_t *prot)
  {
  	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
  	enum page_cache_mode pcm;
  
-	/* reserve the whole chunk starting from paddr */
-	if (!vma || (addr == vma->vm_start
-				&& size == (vma->vm_end - vma->vm_start))) {
-		int ret;
-
-		ret = reserve_pfn_range(paddr, size, prot, 0);
-		if (ret == 0 && vma)
-			vm_flags_set(vma, VM_PAT);
-		return ret;
-	}
-
  	if (!pat_enabled())
  		return 0;
  
-	/*
-	 * For anything smaller than the vma size we set prot based on the
-	 * lookup.
-	 */
  	pcm = lookup_memtype(paddr);
  
  	/* Check memtype for the remaining pages */
@@ -1066,51 +956,18 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  	return 0;
  }
  
-void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
  {
-	enum page_cache_mode pcm;
-
-	if (!pat_enabled())
-		return;
+	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
  
-	/* Set prot based on lookup */
-	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
-			 cachemode2protval(pcm));
+	return reserve_pfn_range(paddr, size, prot, 0);
  }
  
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-		 unsigned long size, bool mm_wr_locked)
+void pfnmap_untrack(unsigned long pfn, unsigned long size)
  {
-	resource_size_t paddr;
+	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
  
-	if (vma && !(vma->vm_flags & VM_PAT))
-		return;
-
-	/* free the chunk starting from pfn or the whole chunk */
-	paddr = (resource_size_t)pfn << PAGE_SHIFT;
-	if (!paddr && !size) {
-		if (get_pat_info(vma, &paddr, NULL))
-			return;
-		size = vma->vm_end - vma->vm_start;
-	}
  	free_pfn_range(paddr, size);
-	if (vma) {
-		if (mm_wr_locked)
-			vm_flags_clear(vma, VM_PAT);
-		else
-			__vm_flags_mod(vma, 0, VM_PAT);
-	}
-}
-
-void untrack_pfn_clear(struct vm_area_struct *vma)
-{
-	vm_flags_clear(vma, VM_PAT);
  }
  
  pgprot_t pgprot_writecombine(pgprot_t prot)
diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
index 76e2801619f09..c33bd3d830699 100644
--- a/drivers/gpu/drm/i915/i915_mm.c
+++ b/drivers/gpu/drm/i915/i915_mm.c
@@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
  
  	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
  
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
  	r.mm = vma->vm_mm;
  	r.pfn = pfn;
  	r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
@@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma,
  	};
  	int err;
  
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
  	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
  
  	while (offset >= r.sgt.max >> PAGE_SHIFT) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf55206935c46..1dc7df6ff38e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -356,9 +356,7 @@ extern unsigned int kobjsize(const void *objp);
  # define VM_SHADOW_STACK	VM_NONE
  #endif
  
-#if defined(CONFIG_X86)
-# define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
  # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
  #elif defined(CONFIG_PARISC)
  # define VM_GROWSUP	VM_ARCH_1
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5c..89b518ff097e6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
  
  #endif  /* CONFIG_ANON_VMA_NAME */
  
+void pfnmap_track_ctx_release(struct kref *ref);
+
  static inline void init_tlb_flush_pending(struct mm_struct *mm)
  {
  	atomic_set(&mm->tlb_flush_pending, 0);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f91..91124761cfda8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -764,6 +764,14 @@ struct vma_numab_state {
  	int prev_scan_seq;
  };
  
+#ifdef __HAVE_PFNMAP_TRACKING
+struct pfnmap_track_ctx {
+	struct kref kref;
+	unsigned long pfn;
+	unsigned long size;
+};
+#endif
+
  /*
   * This struct describes a virtual memory area. There is one of these
   * per VM-area/task. A VM area is any part of the process virtual memory
@@ -877,6 +885,9 @@ struct vm_area_struct {
  	struct anon_vma_name *anon_name;
  #endif
  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef __HAVE_PFNMAP_TRACKING
+	struct pfnmap_track_ctx *pfnmap_track_ctx;
+#endif
  } __randomize_layout;
  
  #ifdef CONFIG_NUMA
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c921..941ef982e1b61 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1489,82 +1489,25 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
   * vmf_insert_pfn.
   */
  
-/*
- * track_pfn_remap is called when a _new_ pfn mapping is being established
- * by remap_pfn_range() for physical range indicated by pfn and size.
- */
-static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-				  unsigned long pfn, unsigned long addr,
-				  unsigned long size)
+/* Cannot fail if size <= PAGE_SIZE. */
+int pfnmap_sanitize(unsigned long pfn, unsigned long size, pgprot_t *prot)
  {
  	return 0;
  }
  
-/*
- * track_pfn_insert is called when a _new_ single pfn is established
- * by vmf_insert_pfn().
- */
-static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-				    pfn_t pfn)
-{
-}
-
-/*
- * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
- * tables copied during copy_page_range(). Will store the pfn to be
- * passed to untrack_pfn_copy() only if there is something to be untracked.
- * Callers should initialize the pfn to 0.
- */
-static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn)
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
  {
  	return 0;
  }
  
-/*
- * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
- * copy_page_range(), but after track_pfn_copy() was already called. Can
- * be called even if track_pfn_copy() did not actually track anything:
- * handled internally.
- */
-static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		unsigned long pfn)
+void pfnmap_untrack(unsigned long pfn, unsigned long size)
  {
  }
  
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-static inline void untrack_pfn(struct vm_area_struct *vma,
-			       unsigned long pfn, unsigned long size,
-			       bool mm_wr_locked)
-{
-}
-
-/*
- * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
- *
- * 1) During mremap() on the src VMA after the page tables were moved.
- * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
- */
-static inline void untrack_pfn_clear(struct vm_area_struct *vma)
-{
-}
  #else
-extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-			   unsigned long pfn, unsigned long addr,
-			   unsigned long size);
-extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-			     pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn);
-extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		unsigned long pfn);
-extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-			unsigned long size, bool mm_wr_locked);
-extern void untrack_pfn_clear(struct vm_area_struct *vma);
+int pfnmap_sanitize(unsigned long pfn, unsigned long size, pgprot_t *prot);
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
+void pfnmap_untrack(unsigned long pfn, unsigned long size);
  #endif
  
  #ifdef CONFIG_MMU
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 15aae955a10bf..aa441f593e9a6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
  	__def_pageflag_names						\
  	) : "none"
  
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_1 {VM_PAT,     "pat"           }
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
  #define __VM_ARCH_SPECIFIC_1 {VM_SAO,     "sao"           }
  #elif defined(CONFIG_PARISC)
  #define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP,	"growsup"	}
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8..a6c54dde5f05c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,7 +481,50 @@ static void vm_area_init_from(const struct vm_area_struct *src,
  #ifdef CONFIG_NUMA
  	dest->vm_policy = src->vm_policy;
  #endif
+#ifdef __HAVE_PFNMAP_TRACKING
+	dest->pfnmap_track_ctx = NULL;
+#endif
+}
+
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return 0;
+
+	/*
+	 * We don't expect to ever hit this. If ever required, we would have
+	 * to duplicate the tracking.
+	 */
+	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
+		return -ENOMEM;
+	kref_get(&ctx->kref);
+	new->pfnmap_track_ctx = ctx;
+	return 0;
+}
+
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return;
+
+	kref_put(&ctx->kref, pfnmap_track_ctx_release);
+	vma->pfnmap_track_ctx = NULL;
+}
+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+}
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma);
+{
  }
+#endif
  
  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
  {
@@ -493,15 +536,15 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
  	vm_area_init_from(orig, new);
+
+	if (vma_pfnmap_track_ctx_dup(orig, new)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return NULL;
+	}
  	vma_lock_init(new, true);
  	INIT_LIST_HEAD(&new->anon_vma_chain);
  	vma_numab_state_init(new);
  	dup_anon_vma_name(orig, new);
-
-	/* track_pfn_copy() will later take care of copying internal state. */
-	if (unlikely(new->vm_flags & VM_PFNMAP))
-		untrack_pfn_clear(new);
-
  	return new;
  }
  
@@ -511,6 +554,7 @@ void vm_area_free(struct vm_area_struct *vma)
  	vma_assert_detached(vma);
  	vma_numab_state_free(vma);
  	free_anon_vma_name(vma);
+	vma_pfnmap_track_ctx_release(vma);
  	kmem_cache_free(vm_area_cachep, vma);
  }
  
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab77..9ad6a0a8f0089 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,7 +1456,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
  			return VM_FAULT_OOM;
  	}
  
-	track_pfn_insert(vma, &pgprot, pfn);
+	/* TODO: we should check the whole range and handle errors. */
+	pfnmap_sanitize(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
+
  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
  	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
  			pgtable);
@@ -1578,7 +1580,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
  	if (addr < vma->vm_start || addr >= vma->vm_end)
  		return VM_FAULT_SIGBUS;
  
-	track_pfn_insert(vma, &pgprot, pfn);
+	/* TODO: we should check the whole range and handle errors. */
+	pfnmap_sanitize(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
  
  	ptl = pud_lock(vma->vm_mm, vmf->pud);
  	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
diff --git a/mm/io-mapping.c b/mm/io-mapping.c
index 01b3627999304..7266441ad0834 100644
--- a/mm/io-mapping.c
+++ b/mm/io-mapping.c
@@ -21,7 +21,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
  	if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags))
  		return -EINVAL;
  
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
  	return remap_pfn_range_notrack(vma, addr, pfn, size,
  		__pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
  			 (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)));
diff --git a/mm/memory.c b/mm/memory.c
index ba3ea0a82f7f7..fdbba7261af4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  	struct mm_struct *dst_mm = dst_vma->vm_mm;
  	struct mm_struct *src_mm = src_vma->vm_mm;
  	struct mmu_notifier_range range;
-	unsigned long next, pfn = 0;
+	unsigned long next;
  	bool is_cow;
  	int ret;
  
@@ -1371,12 +1371,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  	if (is_vm_hugetlb_page(src_vma))
  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
  
-	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
-		if (ret)
-			return ret;
-	}
-
  	/*
  	 * We need to invalidate the secondary MMU mappings only when
  	 * there could be a permission downgrade on the ptes of the
@@ -1418,8 +1412,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  		raw_write_seqcount_end(&src_mm->write_protect_seq);
  		mmu_notifier_invalidate_range_end(&range);
  	}
-	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_copy(dst_vma, pfn);
  	return ret;
  }
  
@@ -1914,9 +1906,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  	if (vma->vm_file)
  		uprobe_munmap(vma, start, end);
  
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0, mm_wr_locked);
-
  	if (start != end) {
  		if (unlikely(is_vm_hugetlb_page(vma))) {
  			/*
@@ -2525,7 +2514,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
  	if (!pfn_modify_allowed(pfn, pgprot))
  		return VM_FAULT_SIGBUS;
  
-	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
+	pfnmap_sanitize(pfn, PAGE_SIZE, &pgprot);
  
  	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
  			false);
@@ -2588,7 +2577,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
  	if (addr < vma->vm_start || addr >= vma->vm_end)
  		return VM_FAULT_SIGBUS;
  
-	track_pfn_insert(vma, &pgprot, pfn);
+	pfnmap_sanitize(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
  
  	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
  		return VM_FAULT_SIGBUS;
@@ -2833,6 +2822,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
  	return error;
  }
  
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
+		unsigned long size, pgprot_t *prot)
+{
+	struct pfnmap_track_ctx *ctx;
+
+	if (pfnmap_track(pfn, size, prot))
+		return ERR_PTR(-EINVAL);
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (unlikely(!ctx)) {
+		pfnmap_untrack(pfn, size);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pfn = pfn;
+	ctx->size = size;
+	kref_init(&ctx->kref);
+	return ctx;
+}
+
+void pfnmap_track_ctx_release(struct kref *ref)
+{
+	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
+
+	pfnmap_untrack(ctx->pfn, ctx->size);
+	kfree(ctx);
+}
+#endif /* __HAVE_PFNMAP_TRACKING */
+
  /**
   * remap_pfn_range - remap kernel memory to userspace
   * @vma: user vma to map to
@@ -2845,20 +2864,54 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
   *
   * Return: %0 on success, negative error code otherwise.
   */
+#ifdef __HAVE_PFNMAP_TRACKING
  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
  		    unsigned long pfn, unsigned long size, pgprot_t prot)
  {
+	struct pfnmap_track_ctx *ctx = NULL;
  	int err;
  
-	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
-	if (err)
-		return -EINVAL;
+	size = PAGE_ALIGN(size);
+
+	/*
+	 * If we cover the full VMA, we'll perform actual tracking, and
+	 * remember to untrack when the last reference to our tracking
+	 * context from a VMA goes away.
+	 *
+	 * If we only cover parts of the VMA, we'll only lookup the prot
+	 * we can use without tracking.
+	 */
+	if (addr == vma->vm_start && addr + size == vma->vm_end) {
+		if (vma->pfnmap_track_ctx)
+			return -EINVAL;
+		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else {
+		err = pfnmap_sanitize(pfn, size, &prot);
+		if (err)
+			return -EINVAL;
+	}
  
  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
-	if (err)
-		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
-	return err;
+	if (err) {
+		if (ctx)
+			kref_put(&ctx->kref, pfnmap_track_ctx_release);
+		return err;
+	}
+
+	if (ctx)
+		vma->pfnmap_track_ctx = ctx;
+	return 0;
+}
+
+#else
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
  }
+#endif
  EXPORT_SYMBOL(remap_pfn_range);
  
  /**
diff --git a/mm/memremap.c b/mm/memremap.c
index 2aebc1b192da9..c417c843e9b1f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
  	}
  	mem_hotplug_done();
  
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
  	pgmap_array_delete(range);
  }
  
@@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
  	if (nid < 0)
  		nid = numa_mem_id();
  
-	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
-			range_len(range));
+	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
+			     &params->pgprot);
  	if (error)
  		goto err_pfn_remap;
  
@@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
  	if (!is_private)
  		kasan_remove_zero_shadow(__va(range->start), range_len(range));
  err_kasan:
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
  err_pfn_remap:
  	pgmap_array_delete(range);
  	return error;
diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84f..6e78e02f74bd3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
  	if (is_vm_hugetlb_page(vma))
  		clear_vma_resv_huge_pages(vma);
  
-	/* Tell pfnmap has moved from this vma */
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_clear(vma);
-
  	*new_vma_ptr = new_vma;
  	return err;
  }
-- 
2.49.0


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-24 12:33           ` David Hildenbrand
@ 2025-04-24 12:49             ` Lorenzo Stoakes
  2025-04-24 12:52               ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-04-24 12:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

On Thu, Apr 24, 2025 at 02:33:54PM +0200, David Hildenbrand wrote:
> > >
> > > ... and I think we still have space in vm_area_struct without increasing it
> > > beyond 192 bytes.
> >
> > Hm, so you're thinking of a general field in the VMA? I thought this would
> > belong to the PAT object somehow?
>
> It's glued to a VMA. The only alternative to using a VMA field would be looking it
> up for a VMA, storing it in an xarray etc ... ends up complicating stuff when there
> is no need to right now.

Yeah no, that'd be crazy.

I guess we are stuck with this so need to think of a sensible way forward.

>
> >
> > Though getting rid of VM_PAT would be fantastic...
> >
> > I wonder if a _general_ VMA ref count would be a bit much just for this
> > case.
>
> I don't think it would be helpful for this case. It's much more similar to the anon
> VMA name (that also has its own kref)

Ahh I see, adding a field like this, that makes a lot more sense, obviously
could put behind appropriate CONFIG_ flag.

Given cacheline alignment probably have a bit of wiggle room there to avoid
increasing VMA size also.

>
> >
> > But maybe I misunderstand your approach :) Happy to obviously look and if
> > not like some crazy thing just for PAT (you can understand why I would not
> > like this) will be supportive :>)
>
> This is something quick (well, longer than I wish it would take) that seems to
> work. There are smaller pat-internal cleanups to be had on top of this, and
> the new functions shall be documented.
>
>
> Observe how:
> * We remove VM_PAT and that weird VM flags manipulation + "locked" flag
> * We remove any traces of the nasty tracking handling from mremap+fork code
> * Just like anon_vma_name, it hooks into vm_area_dup()/vm_area_free().
> * We remove the page table lookup via get_pat_info()->... completely
> * We remove the VMA parameter from PAT code completely
> * We reduce the track/untrack/sanitize interface to 3 functions

Yeah this is all lovely!

OK this should hopefully be workable then!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
  2025-04-24 12:49             ` Lorenzo Stoakes
@ 2025-04-24 12:52               ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-04-24 12:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, Andrew Morton, Ingo Molnar,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, Rik van Riel, H. Peter Anvin, Linus Torvalds

>>> But maybe I misunderstand your approach :) Happy to obviously look and if
>>> not like some crazy thing just for PAT (you can understand why I would not
>>> like this) will be supportive :>)
>>
>> This is something quick (well, longer than I wish it would take) that seems to
>> work. There are smaller pat-internal cleanups to be had on top of this, and
>> the new functions shall be documented.
>>
>>
>> Observe how:
>> * We remove VM_PAT and that weird VM flags manipulation + "locked" flag
>> * We remove any traces of the nasty tracking handling from mremap+fork code
>> * Just like anon_vma_name, it hooks into vm_area_dup()/vm_area_free().
>> * We remove the page table lookup via get_pat_info()->... completely
>> * We remove the VMA parameter from PAT code completely
>> * We reduce the track/untrack/sanitize interface to 3 functions
> 
> Yeah this is all lovely!
> 
> OK this should hopefully be workable then!

Okay, let me polish that up (and see if there is any reasonable way to 
split it up), and write some doc+descriptions .. and do some more testing.

The VMA split reproducer is definitely happy with this already.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-24 12:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-22 14:49 [PATCH v1] kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() David Hildenbrand
2025-04-22 14:54 ` David Hildenbrand
2025-04-23 14:42   ` Lorenzo Stoakes
2025-04-23 15:30     ` David Hildenbrand
2025-04-23 14:41 ` Lorenzo Stoakes
2025-04-23 19:30   ` David Hildenbrand
2025-04-24  7:21     ` Lorenzo Stoakes
2025-04-24  8:45       ` David Hildenbrand
2025-04-24  9:49         ` Lorenzo Stoakes
2025-04-24 12:33           ` David Hildenbrand
2025-04-24 12:49             ` Lorenzo Stoakes
2025-04-24 12:52               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox