linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	xingwei lee <xrivendell7@gmail.com>,
	yuxin wang <wang1315768607@163.com>,
	Marius Fleischer <fleischermarius@gmail.com>,
	Ingo Molnar <mingo@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dan Carpenter <dan.carpenter@linaro.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@surriel.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Peter Xu <peterx@redhat.com>,
	Liam Howlett <liam.howlett@oracle.com>
Subject: Re: [PATCH v3] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
Date: Wed, 2 Apr 2025 14:20:24 +0200	[thread overview]
Message-ID: <f1544453-7bab-4e65-a6f9-d93aaedb8314@redhat.com> (raw)
In-Reply-To: <abd1c60c-e177-4468-b097-f637bda6ff3c@lucifer.local>

>>
>> v2 -> v3:
>> * Make some !MMU configs happy by just moving the code into memtype.c
> 
> Obviously we need to make the bots happy once again, re the issue at [0]...
> 
> [0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/
> 
> Which by the way you... didn't seem to be cc'd on, unless I missed it? I
> had to manually add you in which is... weird.
> 
>>
>> v1 -> v2:
>> * Avoid a second get_pat_info() [and thereby fix the error checking]
>>    by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
>> * Simplify untrack_pfn_copy() by calling untrack_pfn().
>> * Retested
>>
>> Not sure if we want to CC stable ... it's really hard to trigger in
>> sane environments.
> 
> This kind of code path is probably in reality... theoretical. So I'm fine
> with this.
> 

Thanks a bunch for your review!

>>
>> ---
>>   arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
>>   include/linux/pgtable.h   | 28 ++++++++++++++++-----
>>   kernel/fork.c             |  4 +++
>>   mm/memory.c               | 11 +++------
>>   4 files changed, 58 insertions(+), 37 deletions(-)
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index feb8cc6a12bf2..d721cc19addbd 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
>>   	return -EINVAL;
>>   }
>>
>> -/*
>> - * track_pfn_copy is called when vma that is covering the pfnmap gets
>> - * copied through copy_page_range().
>> - *
>> - * If the vma has a linear pfn mapping for the entire range, we get the prot
>> - * from pte and reserve the entire vma range with single reserve_pfn_range call.
>> - */
>> -int track_pfn_copy(struct vm_area_struct *vma)
>> +int track_pfn_copy(struct vm_area_struct *dst_vma,
>> +		struct vm_area_struct *src_vma, unsigned long *pfn)
> 
> I think we need an additional 'tracked' parameter so we know whether or not
> this pfn is valid.

See below.

> 
> It's kind of icky... see the bot report for context, but we we sort of need
> to differentiate between 'error' and 'nothing to do'. Of course PFN can
> conceivably be 0 so we can't just return that or an error (plus return
> values that can be both errors and values are fraught anyway).
> 
> So I guess -maybe- least horrid thing is:
> 
> int track_pfn_copy(struct vm_area_struct *dst_vma,
> 		struct vm_area_struct *src_vma, unsigned long *pfn,
> 		bool *pfn_tracked);
> 
> Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
> do an if (pfn_tracked) untrack_pfn_copy(...).
> 
> I'm really not in favour of just initialising PFN to 0 because there are
> code paths where this might actually get passed around and used
> incorrectly.
> 
> But on the other hand - I get that this is disgusting.

I'm in favor of letting VM_PAT take care of that. Observe how 
untrack_pfn_copy() -> untrack_pfn() takes care of that by checking for 
VM_PAT.

So this should be working as expected? No need to add something on top 
that makes it even more ugly in the caller.

> 
> 
>>   {
>> +	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
>>   	resource_size_t paddr;
>> -	unsigned long vma_size = vma->vm_end - vma->vm_start;
>>   	pgprot_t pgprot;
>> +	int rc;
>>
>> -	if (vma->vm_flags & VM_PAT) {
>> -		if (get_pat_info(vma, &paddr, &pgprot))
>> -			return -EINVAL;
>> -		/* reserve the whole chunk covered by vma. */
>> -		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
>> -	}
>> +	if (!(src_vma->vm_flags & VM_PAT))
>> +		return 0;
> 
> I do always like the use of the guard clause pattern :)
> 
> But here we have a case where now error = 0, pfn not set, and we will try
> to untrack it despite !VM_PAT.

Right, and untrack_pfn() is smart enough to filter that out. (just like 
for any other invokation)

> 
>> +
>> +	/*
>> +	 * Duplicate the PAT information for the dst VMA based on the src
>> +	 * VMA.
>> +	 */
>> +	if (get_pat_info(src_vma, &paddr, &pgprot))
>> +		return -EINVAL;
>> +	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
>> +	if (rc)
>> +		return rc;
> 
> I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
> to use ret in both places.
> 
> But also feel free to ignore this.

"int retval;" ? ;)

> 
>>
>> +	/* Reservation for the destination VMA succeeded. */
>> +	vm_flags_set(dst_vma, VM_PAT);
>> +	*pfn = PHYS_PFN(paddr);
>>   	return 0;
>>   }
>>
>> +void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
>> +{
>> +	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
>> +	/*
>> +	 * Reservation was freed, any copied page tables will get cleaned
>> +	 * up later, but without getting PAT involved again.
>> +	 */
>> +}
>> +
>>   /*
>>    * prot is passed in as a parameter for the new mapping. If the vma has
>>    * a linear pfn mapping for the entire range, or no vma is provided,
>> @@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>>   	}
>>   }
>>
>> -/*
>> - * untrack_pfn_clear is called if the following situation fits:
>> - *
>> - * 1) while mremapping a pfnmap for a new region,  with the old vma after
>> - * its pfnmap page table has been removed.  The new vma has a new pfnmap
>> - * to the same pfn & cache type with VM_PAT set.
>> - * 2) while duplicating vm area, the new vma fails to copy the pgtable from
>> - * old vma.
>> - */
> 
 > This just wrong now?

Note that I'm keeping the doc to a single place -- the stub in the 
header. (below)

Or can you elaborate what exactly is "wrong"?

> 
>>   void untrack_pfn_clear(struct vm_area_struct *vma)
>>   {
>>   	vm_flags_clear(vma, VM_PAT);
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 94d267d02372e..4c107e17c547e 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>>   }
>>
>>   /*
>> - * track_pfn_copy is called when vma that is covering the pfnmap gets
>> - * copied through copy_page_range().
>> + * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
>> + * tables copied during copy_page_range(). On success, stores the pfn to be
>> + * passed to untrack_pfn_copy().
>>    */
>> -static inline int track_pfn_copy(struct vm_area_struct *vma)
>> +static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
>> +		struct vm_area_struct *src_vma, unsigned long *pfn)
>>   {
>>   	return 0;
>>   }
>>
>> +/*
>> + * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
>> + * copy_page_range(), but after track_pfn_copy() was already called.
>> + */
> 
> Do we really care to put a comment like this on a stub function?

Whoever started this beautiful VM_PAT code decided to do it like that: 
and I think it's better kept at a single location.

> 
>> +static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> +		unsigned long pfn)
>> +{
>> +}
>> +
>>   /*
>>    * untrack_pfn is called while unmapping a pfnmap for a region.
>>    * untrack can be called for a specific region indicated by pfn and size or
>> @@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
>>   }
>>
>>   /*
>> - * untrack_pfn_clear is called while mremapping a pfnmap for a new region
>> - * or fails to copy pgtable during duplicate vm area.
>> + * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
>> + *
>> + * 1) During mremap() on the src VMA after the page tables were moved.
>> + * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
>>    */
> 
> Can I say as an aside that I hate this kind of hook? Like quite a lot?
> 
> I mean I've been looking at mremap() of anon mappings as you know obv. but
> the thought of PFN mapping mremap()ing is kind of also a bit ugh.

I absolutely hate all of that, but I'll have to leave any cleanups to 
people with more spare time ;)

> 
>>   static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>>   {
>> @@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>   			   unsigned long size);
>>   extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>>   			     pfn_t pfn);
>> -extern int track_pfn_copy(struct vm_area_struct *vma);
>> +extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>> +		struct vm_area_struct *src_vma, unsigned long *pfn);
>> +extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> +		unsigned long pfn);
>>   extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
>>   			unsigned long size, bool mm_wr_locked);
>>   extern void untrack_pfn_clear(struct vm_area_struct *vma);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 735405a9c5f32..ca2ca3884f763 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>>   	vma_numab_state_init(new);
>>   	dup_anon_vma_name(orig, new);
>>
>> +	/* track_pfn_copy() will later take care of copying internal state. */
>> +	if (unlikely(new->vm_flags & VM_PFNMAP))
>> +		untrack_pfn_clear(new);
> 
> OK so maybe I'm being stupid here, but - is it the case that
> 
> a. We duplicate a VMA that has a PAT-tracked PFN map
 > b. We must clear any existing tracking so everything is 'reset' to 
zero> c. track_pfn_copy() will later in fork process set anything up we 
need here.
> 
> Is this correct?

Right. But b) is actually not "clearing any tracking" (because there is 
no tracking/reservation for the copied version yet) but marking it as 
"not tracked".

> 
>> +
>>   	return new;
>>   }
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb7b8dc751679..dc8efa1358e94 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1362,12 +1362,12 @@ int
>>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   {
>>   	pgd_t *src_pgd, *dst_pgd;
>> -	unsigned long next;
>>   	unsigned long addr = src_vma->vm_start;
>>   	unsigned long end = src_vma->vm_end;
>>   	struct mm_struct *dst_mm = dst_vma->vm_mm;
>>   	struct mm_struct *src_mm = src_vma->vm_mm;
>>   	struct mmu_notifier_range range;
>> +	unsigned long next, pfn;
>>   	bool is_cow;
>>   	int ret;
>>
>> @@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>>
>>   	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
>> -		/*
>> -		 * We do not free on error cases below as remove_vma
>> -		 * gets called on error from higher level routine
>> -		 */
>> -		ret = track_pfn_copy(src_vma);
>> +		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
>>   		if (ret)
>>   			return ret;
>>   	}
>> @@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   			continue;
>>   		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
>>   					    addr, next))) {
>> -			untrack_pfn_clear(dst_vma);
>>   			ret = -ENOMEM;
>>   			break;
>>   		}
>> @@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   		raw_write_seqcount_end(&src_mm->write_protect_seq);
>>   		mmu_notifier_invalidate_range_end(&range);
>>   	}
>> +	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
>> +		untrack_pfn_copy(dst_vma, pfn);
> 
> Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
> _only_ way we can not have a valid pfn.
> 
> Do we still want to untrack_pfn_copy() even if !VM_PAT?

Sure, let that be handled internally, where all the ugly VM_PAT handling 
resides.

Unless there is very good reason to do it differently.

Thanks!

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-04-02 12:20 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-25 19:19 David Hildenbrand
2025-04-02 11:36 ` David Hildenbrand
2025-04-02 12:32   ` Lorenzo Stoakes
2025-04-03 14:47     ` David Hildenbrand
2025-04-03 14:50       ` Lorenzo Stoakes
2025-04-02 11:59 ` Lorenzo Stoakes
2025-04-02 12:20   ` David Hildenbrand [this message]
2025-04-02 12:31     ` Lorenzo Stoakes
2025-04-02 15:19       ` David Hildenbrand
2025-04-02 11:37 Lorenzo Stoakes
2025-04-03 15:14 ` [PATCH v3] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range() Dan Carpenter
2025-04-03 20:59   ` David Hildenbrand
2025-04-04 11:52     ` Lorenzo Stoakes
2025-04-04 12:20       ` David Hildenbrand
2025-04-04 12:27         ` David Hildenbrand
2025-04-06 17:17           ` Ingo Molnar
2025-04-07  7:11     ` Dan Carpenter
     [not found] <202503270941.IFILyNCX-lkp@intel.com>
     [not found] ` <9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local>
     [not found]   ` <b21bcd61-faf0-4ad8-b644-99794794594f@redhat.com>
2025-04-02 11:40     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f1544453-7bab-4e65-a6f9-d93aaedb8314@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=fleischermarius@gmail.com \
    --cc=hpa@zytor.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wang1315768607@163.com \
    --cc=x86@kernel.org \
    --cc=xrivendell7@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox